Analysing IMDb Movies Ratings

1. Problem Statement

In recent years, there has been a rush towards creating more digital content to capture user attention. Streaming services like Netflix, Disney+ and AppleTV are creating their own content that could rival existing film production companies. With more digital content and content producers, competition between film companies for viewership are heating up. Hence, it is becoming increasingly urgent and important to understand the factors that result in successful movies.

To understand the factors that result in successful movies, our team has decided to analyse IMDB’s official dataset (IMDb, n.d.), given that the company does not permit any form of web scraping on its site. The dataset contained over 100,000 movies since 1900 and provides a variety of information such as ratings, number of votes, genre, duration and publishing year.

Based on research, it was only in the late 1980s, where the modern trends of movies having a higher proportion of foreign box office than domestic ticket sales occur (Hall & Neale, 2010), suggesting that more international audiences viewed the movie than in the country of production. Hence, to provide an analysis of movies that will more likely suit global audiences, our team has decided to restrict the analysis to only movies produced from 1990 to 2019.

Our project is exploratory in nature and it aims to:

  1. Examine if genre, duration of movie and year of release will affect the success (ratings) of the movies from 1990 to 2019.

  2. Create a model to measure the effect of genre on the ratings of movies.

  3. If point 2 is true, check if there is any effect of movie duration on genre and ratings.

For this data science project, we have activated the following packages:

library(tidyverse)
library(lubridate)
library(gridExtra)
library(grid)
library(RColorBrewer)
library(wordcloud2)
library(broom.mixed)
library(haven) 
library(broom)
library(gvlma) 
library(jtools)
library(huxtable) 
library(ggfortify)
library(knitr)
library(kableExtra)
library(ggrepel)
library(ggthemes)
library(sandwich) 
library(psych)
library(Hmisc) 
library(ggcorrplot)
library(ggstance)
library(car)
library(plotly)
library(gtrendsR)
library(ggiraph)
library(interactions)
library(shiny)
library(ggwordcloud)
library(httr)
library(jsonlite)
library(textdata)
library(tibble)
library(here)
library(rvest)
library(glue)
library(tidytext)

options(shiny.sanitize.errors = TRUE)
options(scipen = 9999)

2. Import

Here, we import the IMDb official datasets. The datasets were downloaded from https://datasets.imdbws.com/

title_basics <- read_tsv("title_basics.tsv")
ratings <- read_tsv("ratings.tsv")
glimpse(title_basics)
## Rows: 5,826,089
## Columns: 9
## $ tconst         <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "t…
## $ titleType      <chr> "short", "short", "short", "short", "short", "short", …
## $ primaryTitle   <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierro…
## $ originalTitle  <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierro…
## $ isAdult        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ startYear      <dbl> 1894, 1892, 1892, 1892, 1893, 1894, 1894, 1894, 1894, …
## $ endYear        <chr> "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N"…
## $ runtimeMinutes <chr> "1", "5", "4", "12", "1", "1", "1", "1", "45", "1", "1…
## $ genres         <chr> "Documentary,Short", "Animation,Short", "Animation,Com…
glimpse(ratings)
## Rows: 1,086,195
## Columns: 3
## $ tconst        <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "tt…
## $ averageRating <dbl> 5.6, 6.1, 6.5, 6.2, 6.2, 5.3, 5.4, 5.4, 5.9, 6.9, 5.2, …
## $ numVotes      <dbl> 1656, 201, 1368, 122, 2150, 115, 661, 1820, 155, 6074, …

2.1 Variables

The title_basics dataset consists of 9 variables and 5,751,919 observations, and the ratings dataset consists of 3 variables and 1,086,028.
The variables that will be selected in this study are:

  • Title of the movies (primaryTitle)
  • Year of the movies released (startYear)
  • Duration of the movies (runtimeMinutes)
  • Genres of the movies (genres)
  • Weighted average rating of the movies (averageRating)
  • Number of votes the movies received (numVotes)

3. Tidy & Transform

3.1 Merging of datasets

For our analysis, the two datasets are merged by inner_join method based on the unique identifier, tconst, which is the title ID of the movies. This ensures all rows of both tables are selected based on the matched tconst column.

movies_dataset <- inner_join(title_basics, ratings, by = "tconst", copy = FALSE, suffix = c(".title_basics", ".ratings"))

glimpse(movies_dataset)
## Rows: 859,290
## Columns: 11
## $ tconst         <chr> "tt0000001", "tt0000002", "tt0000003", "tt0000004", "t…
## $ titleType      <chr> "short", "short", "short", "short", "short", "short", …
## $ primaryTitle   <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierro…
## $ originalTitle  <chr> "Carmencita", "Le clown et ses chiens", "Pauvre Pierro…
## $ isAdult        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ startYear      <dbl> 1894, 1892, 1892, 1892, 1893, 1894, 1894, 1894, 1894, …
## $ endYear        <chr> "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N"…
## $ runtimeMinutes <chr> "1", "5", "4", "12", "1", "1", "1", "1", "45", "1", "1…
## $ genres         <chr> "Documentary,Short", "Animation,Short", "Animation,Com…
## $ averageRating  <dbl> 5.6, 6.1, 6.5, 6.2, 6.2, 5.3, 5.4, 5.4, 5.9, 6.9, 5.2,…
## $ numVotes       <dbl> 1656, 201, 1368, 122, 2150, 115, 661, 1820, 155, 6074,…

3.2 Filtering unneccesary variables and data

Firstly, we narrow down the release year of the movies to 1990 to 2019 inclusive. This period was selected as 1990s is the period where foreign box office outpace the domestic box office of movies (Hall & Neale, 2010). In addition, given that this project is done in 2020, the dataset for the year is incomplete.

Secondly, as the IMDb dataset include different types of content (titleType) such as dramas and games, only movie and tvMovie was included, since this study is specific to movies.

Thirdly, we have removed the following variables/columns that are not relevant to our analysis:

  • originalTitle - Name of movie title in different countries
  • isAdult - Whether the movie is labeled as an Adult content
  • endYear - The year where the movie stopped screening
movies_dataset <- movies_dataset %>% 
  filter(titleType %in% c("movie", "tvMovie")) %>% 
  select(-originalTitle, -isAdult, -endYear)

movies_dataset <- movies_dataset %>% 
  filter(startYear %in% c(1990:2019))

glimpse(movies_dataset)
## Rows: 156,450
## Columns: 8
## $ tconst         <chr> "tt0015414", "tt0015724", "tt0016906", "tt0077432", "t…
## $ titleType      <chr> "movie", "movie", "movie", "movie", "movie", "movie", …
## $ primaryTitle   <chr> "La tierra de los toros", "Dama de noche", "Frivolinas…
## $ startYear      <dbl> 2000, 1993, 2014, 1991, 1993, 1990, 2011, 1995, 1991, …
## $ runtimeMinutes <chr> "60", "102", "80", "85", "94", "100", "122", "73", "10…
## $ genres         <chr> "\\N", "Drama,Mystery,Romance", "Comedy,Musical", "Act…
## $ averageRating  <dbl> 5.4, 6.2, 5.6, 6.7, 5.4, 6.3, 7.1, 6.9, 7.3, 5.8, 7.3,…
## $ numVotes       <dbl> 11, 20, 15, 6, 244, 289, 247, 303, 39, 12, 201, 109, 1…

3.3 Removing empty values in the dataset

Now, we check for the presence of \N values in the dataset to remove empty values in the dataset.

map(movies_dataset, ~sum("\\N" %in% .))
## $tconst
## [1] 0
## 
## $titleType
## [1] 0
## 
## $primaryTitle
## [1] 0
## 
## $startYear
## [1] 0
## 
## $runtimeMinutes
## [1] 1
## 
## $genres
## [1] 1
## 
## $averageRating
## [1] 0
## 
## $numVotes
## [1] 0

From the above results, there are some \N in the dataset under the variables of runtimeMinutes and genres. Therefore, we removed \N observations in the dataset in these two variables.

movies_dataset <- movies_dataset %>% 
  filter(runtimeMinutes != "\\N", genres != "\\N")

glimpse(movies_dataset)
## Rows: 134,952
## Columns: 8
## $ tconst         <chr> "tt0015724", "tt0016906", "tt0077432", "tt0081145", "t…
## $ titleType      <chr> "movie", "movie", "movie", "movie", "movie", "movie", …
## $ primaryTitle   <chr> "Dama de noche", "Frivolinas", "Bloody Hero", "Me and …
## $ startYear      <dbl> 1993, 2014, 1991, 1993, 1990, 2011, 1995, 1991, 1993, …
## $ runtimeMinutes <chr> "102", "80", "85", "94", "100", "122", "73", "101", "9…
## $ genres         <chr> "Drama,Mystery,Romance", "Comedy,Musical", "Action,Dra…
## $ averageRating  <dbl> 6.2, 5.6, 6.7, 5.4, 6.3, 7.1, 6.9, 7.3, 5.8, 7.3, 6.5,…
## $ numVotes       <dbl> 20, 15, 6, 244, 289, 247, 303, 39, 12, 201, 109, 12, 8…

3.4 Binning of release year of movies

Next, to see see if there are any trends in different time periods, the release year of the movies were binned into 5-year groups. The bins of the release years are as follows:

  • 1990: Between 1990 to 1994
  • 1995: Between 1995 to 1999
  • 2000: Between 2000 to 2004
  • 2005: Between 2005 to 2009
  • 2010: Between 2010 to 2014
  • 2015: Between 2015 to 2019
movies_dataset <- movies_dataset %>% 
  mutate(Five_Year =
           ifelse(startYear >= 1990 & startYear < 1995, "1990",
                  ifelse(startYear >= 1995 & startYear < 2000, "1995",
                         ifelse(startYear >= 2000 & startYear < 2005, "2000",
                                ifelse(startYear >= 2005 & startYear < 2010, "2005",
                                       ifelse(startYear >= 2010 & startYear < 2015, "2010",
                                              ifelse(startYear >= 2015, "2015", NA_real_)))))))

glimpse(movies_dataset)
## Rows: 134,952
## Columns: 9
## $ tconst         <chr> "tt0015724", "tt0016906", "tt0077432", "tt0081145", "t…
## $ titleType      <chr> "movie", "movie", "movie", "movie", "movie", "movie", …
## $ primaryTitle   <chr> "Dama de noche", "Frivolinas", "Bloody Hero", "Me and …
## $ startYear      <dbl> 1993, 2014, 1991, 1993, 1990, 2011, 1995, 1991, 1993, …
## $ runtimeMinutes <chr> "102", "80", "85", "94", "100", "122", "73", "101", "9…
## $ genres         <chr> "Drama,Mystery,Romance", "Comedy,Musical", "Action,Dra…
## $ averageRating  <dbl> 6.2, 5.6, 6.7, 5.4, 6.3, 7.1, 6.9, 7.3, 5.8, 7.3, 6.5,…
## $ numVotes       <dbl> 20, 15, 6, 244, 289, 247, 303, 39, 12, 201, 109, 12, 8…
## $ Five_Year      <chr> "1990", "2010", "1990", "1990", "1990", "2010", "1995"…

3.5 Selection of genre

In the genre column, movies have 1-3 genres that are randomly ordered. Hence, the team has built a random function to randomly select one of the genres as the Final_Genre variable, which would be used for further analysis.

generate_genre <- function(text) {
  splitted <- strsplit(text, ",")
  no_of_genre <- length(unlist(splitted))
  random_number <- sample(1:no_of_genre, 1)
  result <- unlist(splitted)[random_number]
  return(result)
}

movies_dataset <- movies_dataset %>% 
  rowwise() %>% 
  mutate(Final_Genre = generate_genre(genres))

glimpse(movies_dataset)
## Rows: 134,952
## Columns: 10
## Rowwise: 
## $ tconst         <chr> "tt0015724", "tt0016906", "tt0077432", "tt0081145", "t…
## $ titleType      <chr> "movie", "movie", "movie", "movie", "movie", "movie", …
## $ primaryTitle   <chr> "Dama de noche", "Frivolinas", "Bloody Hero", "Me and …
## $ startYear      <dbl> 1993, 2014, 1991, 1993, 1990, 2011, 1995, 1991, 1993, …
## $ runtimeMinutes <chr> "102", "80", "85", "94", "100", "122", "73", "101", "9…
## $ genres         <chr> "Drama,Mystery,Romance", "Comedy,Musical", "Action,Dra…
## $ averageRating  <dbl> 6.2, 5.6, 6.7, 5.4, 6.3, 7.1, 6.9, 7.3, 5.8, 7.3, 6.5,…
## $ numVotes       <dbl> 20, 15, 6, 244, 289, 247, 303, 39, 12, 201, 109, 12, 8…
## $ Five_Year      <chr> "1990", "2010", "1990", "1990", "1990", "2010", "1995"…
## $ Final_Genre    <chr> "Mystery", "Comedy", "Drama", "Comedy", "Drama", "Come…

3.6 Removing movies with less than 1,000 votes

The team has decided to only analyse choose movies with at least 1,000 votes to ensure that the ratings are representative and reliable.

movies_dataset <- movies_dataset %>% 
  filter(numVotes >= 1000)

3.7 Removing outliers in average rating

As averageRating is the dependent variable for this project, the team has decided to check for the presence of outliers.

checking_rating_outlier <- movies_dataset %>% 
  ggplot(aes(y = averageRating)) +
  geom_boxplot() +
  labs(title = "IMDb Movies Rating Distribution",
       subtitle = "Presence of extreme outliers",
       caption = "Source: IMDb",
       y = "Rating") + 
  theme_linedraw()

ggplotly(checking_rating_outlier)

From the boxplot diagram, extreme rating outliers can be observed where the rating is below 3.3. Hence, the outliers are removed by using the outlier function provided by Assistant Professor S. Roh.

remove_outliers <- function(x, na.rm = T, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

movies_dataset$averageRating_cleaned <- remove_outliers(movies_dataset$averageRating)

movies_cleaned_dataset <- movies_dataset %>% drop_na() %>% select(-averageRating_cleaned)

To check if the outliers have been removed, a boxplot is created to show the distribution of movie ratings before and after of the removal of outliers. Visually, it appears that most outliers have been removed.

removed_rating_outlier <- movies_cleaned_dataset %>% 
  ggplot(aes(y = averageRating)) +
  geom_boxplot() +
  labs(title = "IMDb Movies Rating Distribution",
       subtitle = "Removal of extreme outliers",
       caption = "Source: IMDb",
       y = "Rating") +
  theme_linedraw()

grid.arrange(checking_rating_outlier, removed_rating_outlier, ncol = 2)

3.8 Removing genres with less than 100 movies

To ensure that the analysis is reliable and representative, our team has decided to remove genres that have less than 100 movies in the 20 year period.

num_movies_per_genre <- movies_cleaned_dataset %>% 
  group_by(Final_Genre) %>% 
  summarise(No_of_Movies = n()) %>% 
  arrange(desc(No_of_Movies))
## `summarise()` ungrouping output (override with `.groups` argument)
knitr::kable(num_movies_per_genre,align = "lc", format = "html") %>%
    kable_styling()
Final_Genre No_of_Movies
Drama 5513
Comedy 3475
Romance 1495
Action 1483
Thriller 1306
Crime 1165
Horror 1018
Documentary 779
Adventure 734
Mystery 542
Biography 459
Fantasy 435
Sci-Fi 378
Family 353
Animation 274
History 263
Music 246
Sport 187
War 140
Musical 104
Western 44
News 6
Adult 1
Short 1

To detect genres with at least 100 movies, we create an empty vector (result_list) to contain those genres.

result_list <- c()

for (i in 1:nrow(num_movies_per_genre)) {
  if (num_movies_per_genre$No_of_Movies[i] >= 100) {
    result_list <- c(num_movies_per_genre$Final_Genre[i], result_list)
  }
}

result_list
##  [1] "Musical"     "War"         "Sport"       "Music"       "History"    
##  [6] "Animation"   "Family"      "Sci-Fi"      "Fantasy"     "Biography"  
## [11] "Mystery"     "Adventure"   "Documentary" "Horror"      "Crime"      
## [16] "Thriller"    "Action"      "Romance"     "Comedy"      "Drama"

To ensure that the randomly generated list of genres will only contain movies with >= 100 movies, the result_list created above will be used to filter those genres.

movies_cleaned_dataset <- movies_cleaned_dataset %>% 
                  filter(Final_Genre %in% result_list)

3.9 Binning of movie duration

The team has binned the duration of the movies into 5 categories based on percentile. This is to help us to create categorical variables for duration that would be beneficial for further analysis. Furthermore, we did further transformation of the runtimeMinutes variable to transform it into a numerical variable.

movies_cleaned_dataset$runtimeMinutes = as.numeric(movies_cleaned_dataset$runtimeMinutes)

The quantile function was used to identify the movie duration in minutes for each percentile group.

quantiles <- quantile(movies_cleaned_dataset$runtimeMinutes, c(.20, .40, .60, .80, 1))

knitr::kable(quantiles,align = "c", col.names = "Duration of Movie (mins)", format = "html") %>%
  kable_styling()
Duration of Movie (mins)
20% 90
40% 97
60% 105
80% 119
100% 467

Now, it’s time for us to categorise the duration of the movies into five categories:

  • Very Short: Within the 20th percentile
  • Short: Between the 21st percentile to 40th percentile
  • Just Nice: Between the 41st percentile to 60th percentile
  • Long: Between the 61st percentile to 80th percentile
  • Extremely Long: Between the 81st percentile to 100th percentile
movies_cleaned_dataset <- movies_cleaned_dataset %>% 
  mutate(Types_of_Duration =
           ifelse(runtimeMinutes <= 90, "Very Short",
                  ifelse(runtimeMinutes > 90 & runtimeMinutes <= 97, "Short",
                         ifelse(runtimeMinutes > 97 & runtimeMinutes <= 105, "Just Nice",
                                ifelse(runtimeMinutes > 105 & runtimeMinutes <= 119, "Long",
                                       ifelse(runtimeMinutes > 119 & runtimeMinutes <= 467, "Extremely Long", NA_real_))))))

glimpse(movies_cleaned_dataset)
## Rows: 20,349
## Columns: 11
## $ tconst            <chr> "tt0090665", "tt0094997", "tt0096871", "tt0096875",…
## $ titleType         <chr> "movie", "movie", "movie", "movie", "movie", "movie…
## $ primaryTitle      <chr> "Halfaouine: Boy of the Terraces", "Demonia", "Baby…
## $ startYear         <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1994, 199…
## $ runtimeMinutes    <dbl> 98, 88, 88, 116, 108, 108, 102, 168, 101, 56, 107, …
## $ genres            <chr> "Comedy,Drama", "Horror,Mystery", "Horror", "Action…
## $ averageRating     <dbl> 6.7, 4.5, 6.0, 5.4, 7.3, 7.6, 3.6, 7.2, 5.6, 5.9, 7…
## $ numVotes          <dbl> 1198, 1210, 1512, 3320, 3587, 1220, 1387, 1762, 100…
## $ Five_Year         <chr> "1990", "1990", "1990", "1990", "1990", "1990", "19…
## $ Final_Genre       <chr> "Drama", "Mystery", "Horror", "Action", "Drama", "C…
## $ Types_of_Duration <chr> "Just Nice", "Very Short", "Very Short", "Long", "L…

3.10 Removing and Renaming variables

Moving on, we are going to remove irrelevant variables and rename the variables’ names into meaningful names.

The title id tconst and title type titleType are removed since the unique identifiers are not neccesary for the project’s analysis

The following variable names renamed:

  • primaryTitle cleansed to Movie
  • startYear cleansed to Year
  • runtimeMinutes cleansed to Duration
  • genres cleansed to Genres
  • averageRating cleansed to Rating
  • numVotes cleansed to Num_of_Votes
movies_cleaned_dataset <- movies_cleaned_dataset %>% 
  select(-tconst, -titleType)

movies_cleaned_dataset <- movies_cleaned_dataset %>% 
  rename(Movie = primaryTitle,
         Year = startYear,
         Duration = runtimeMinutes,
         Genres = genres,
         Rating = averageRating,
         Num_of_Votes = numVotes)

glimpse(movies_cleaned_dataset)
## Rows: 20,349
## Columns: 9
## $ Movie             <chr> "Halfaouine: Boy of the Terraces", "Demonia", "Baby…
## $ Year              <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1994, 199…
## $ Duration          <dbl> 98, 88, 88, 116, 108, 108, 102, 168, 101, 56, 107, …
## $ Genres            <chr> "Comedy,Drama", "Horror,Mystery", "Horror", "Action…
## $ Rating            <dbl> 6.7, 4.5, 6.0, 5.4, 7.3, 7.6, 3.6, 7.2, 5.6, 5.9, 7…
## $ Num_of_Votes      <dbl> 1198, 1210, 1512, 3320, 3587, 1220, 1387, 1762, 100…
## $ Five_Year         <chr> "1990", "1990", "1990", "1990", "1990", "1990", "19…
## $ Final_Genre       <chr> "Drama", "Mystery", "Horror", "Action", "Drama", "C…
## $ Types_of_Duration <chr> "Just Nice", "Very Short", "Very Short", "Long", "L…

3.11 Ensure no null values in the tidied dataset

Finally, let’s ensure that there is no presence of NAs in the cleaned dataset. Great! There is no NAs present in the cleaned dataset.

map(movies_cleaned_dataset, ~sum(is.na(.)))
## $Movie
## [1] 0
## 
## $Year
## [1] 0
## 
## $Duration
## [1] 0
## 
## $Genres
## [1] 0
## 
## $Rating
## [1] 0
## 
## $Num_of_Votes
## [1] 0
## 
## $Five_Year
## [1] 0
## 
## $Final_Genre
## [1] 0
## 
## $Types_of_Duration
## [1] 0

3.12 Export and Import final dataset

Given that the genres are randomly selected (shown in Section 3.4), the dataset will differ slightly every iteration of the Rmarkdown. Hence, our group has exported a version of the dataset based on the above tidy process. In this section, we will read the dataset that will be used for further analysis.

#write.csv(movies_cleaned_dataset,"final_imdb_dataset.csv", fileEncoding = "UTF-8", row.names = FALSE)
dataset <- read_csv("final_imdb_dataset.csv")
## Parsed with column specification:
## cols(
##   Movie = col_character(),
##   Year = col_double(),
##   Duration = col_double(),
##   Genres = col_character(),
##   Rating = col_double(),
##   Num_of_Votes = col_double(),
##   Five_Year = col_double(),
##   Final_Genre = col_character(),
##   Types_of_Duration = col_character()
## )
dataset$Final_Genre[dataset$Final_Genre == "Sci-Fi"] <- "SciFi"
glimpse(dataset)
## Rows: 20,253
## Columns: 9
## $ Movie             <chr> "Halfaouine: Boy of the Terraces", "Demonia", "Baby…
## $ Year              <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1994, 199…
## $ Duration          <dbl> 98, 88, 88, 116, 108, 108, 102, 168, 101, 56, 107, …
## $ Genres            <chr> "Comedy,Drama", "Horror,Mystery", "Horror", "Action…
## $ Rating            <dbl> 6.7, 4.5, 6.0, 5.4, 7.3, 7.6, 3.6, 7.2, 5.6, 5.9, 7…
## $ Num_of_Votes      <dbl> 1198, 1207, 1511, 3318, 3583, 1218, 1389, 1761, 100…
## $ Five_Year         <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 199…
## $ Final_Genre       <chr> "Drama", "Mystery", "Horror", "Action", "Comedy", "…
## $ Types_of_Duration <chr> "Just Nice", "Very Short", "Very Short", "Long", "L…
unique_values <- sapply(dataset, function(x) length(unique(x)))
unique_values
##             Movie              Year          Duration            Genres 
##             19459                30               198               586 
##            Rating      Num_of_Votes         Five_Year       Final_Genre 
##                60             11292                 6                19 
## Types_of_Duration 
##                 5

Now, let’s start our Exploratory Data Analysis (EDA)!

4. Exploratory Data Analysis

4.1 Visualisation of Number of Movies per Genre

In the Movie variable, there are 20268 unique categories. Additionally, in the Final_Genre variable, there are 19 unique categories. From the word cloud, Drama has the most number of movies while Sport have the least number of movies.

4.2 Visualisation of Ratings by Genre

4.2.1 Barplot of average Ratings of Genre

Based on the bar chart, the the top 3 highest rated movie genres are Documentary, Biography and War. The lowest 3 rated genres are Thriller, Sci-Fi and Horror.

4.2.2 Boxplot of Ratings of Genre

From the boxplot diagram, Comedy and Mystery have similar mean and interquartile range rating. Family and Adventure have similar mean and interquaritle range rating. History and Music have similar mean and interquaritle range rating.

4.2.3 Average Rating of each Genre over the Years

Among the 19 genres, majority of the genres are decreasing in average ratings over the years from 1990 to 2019. The only exceptions are Family, Music, Sci-Fi, and Sportthat has been increasing in average ratings.

4.3 Visualisation of Ratings by Duration

4.3.1 Scatterplot of Ratings by Duration

Based on the scatter plot, there appears to be a general trend where the longer the movie duration, the higher the movie rating.

4.3.2 Barplot of Ratings by Duration bins

From the bar chart based on the 5 movie duration bins, there is a similar trend where movies with longer duration have higher average rating. Both Very Short and Short duration of movies have similar average rating while the average rating increases from Just Nice to Long to Extremely Long duration of movies.

Bins for Movie duration (minutes):

  • Very Short: < 90
  • Short: 91 - 97
  • Just Nice: 98 - 105
  • Long: 106 - 119
  • Extremely Long: 119 - 467

4.4 Visualisation of Ratings by Year

4.4.1 Scatterplot of Ratings by Year

Based on the scatter plot, there is a general downward trend in ratings over the year.

4.4.2 Barplot of ratings by year bins

From the bar chart distribution of the 6 5-year bins, we can the same trend where the average rating of movies have some slight fluctuations over the years. However, there is little to no visually significant changes in ratings over the year

5. Model

Model 1: ANOVA
Model 2: Correlation
Model 3: Simple Linear Regression
Model 4: Multiple Linear Regression

There are two forms of Modelling we can do:
1. Modelling for Explanation
2. Modelling for Prediction
Our problem statement requires us to do the first one. We begin with statistical inference to identify which variables are significant. After which we analyse the causation between the variables and rating.

Model 1: ANOVA

Model 1 consists of two parts, ANOVA for the 19 Genres. Followed by Post-Hoc, to do pair-wise comparison between each genre.

1.1.1 Setting up ANOVA hypothesis

For this model, we want to compare mean ratings across all the genres. We should apply a holistic test to check whether there is evidence that at least one pair groups are in fact different, and this is where ANOVA saves the day.

We have 19 unique genres in our dataset with varying means. We assume these genres to have met the three conditions to perform ANOVA.
1. Independence of genres
2. Rating data is normal in each genre
3. Variability of rating across genres is about equal

With these conditions met, we set our hypothesis as follows: \[ \begin{aligned} H_0&: \text{The mean rating across all the genres are equal. In statistical notation,}\ μ_1 = μ_2 =...= μ19 \ \text{where} \ μi \ \text {is the mean of the outcome.} \\ H_1&: \text{At least one mean is different.} \end{aligned} \] Strong evidence favoring the alternative hypothesis in ANOVA is described by unusually large differences among the group means. With this, we test our ANOVA between the genres.

1.1.2 ANOVA Table and Interpretation

ANOVA_tidy_genre <- dataset %>% 
  aov(Rating ~ Final_Genre, .) %>% 
  tidy()

ANOVA_tidy_genre
termdfsumsqmeansqstatisticp.value
Final_Genre18       3.22e+03179    1980
Residuals2.02e+041.83e+040.904

These results show that there’s sufficient evidence against null hypothesis with a high F value and low p-value thats extremely close to 0 at a significance level of 0.01.

With a statistically significant result from ANOVA that shows mean ratings across genres do vary, we now look at conducting a Post-Hoc test. This test allows for a pair-wise comparison between each genre to see which genres are the significant ones.

1.1.3 Descriptive statistics and visualisation

Here, our group seeks to visualize confidence intervals instead of relying on P value. As taught in class, confidence intervals are a more robust test, as compared to P value.

dataset_linear <- dataset %>%
  mutate(Final_Genre = fct_relevel(as_factor(Final_Genre), "Documentary"))

Confidence_Interval_calculation <- function(myvariable, myproportion){
  tmp = summary(lm(myvariable~1))
  my_se = tmp$coef[2] 
  my_df = tmp$df[2]  
  myP = 1 - (0.5 * (1 - myproportion))  
  my_ci = qt (myP, my_df) * my_se 
  my_ci
}

ANOVA_initial <- dataset %>% 
  group_by(Final_Genre) %>% 
  dplyr::summarise(sample_mean = mean(Rating,na.rm = T), 
            confidence_intervals = Confidence_Interval_calculation(Rating, .95)) %>% 
  mutate(population_mean_lower = sample_mean - confidence_intervals,
         population_mean_upper = sample_mean + confidence_intervals)
## `summarise()` ungrouping output (override with `.groups` argument)
ANOVA_initial

Final_Genresample_meanconfidence_intervalspopulation_mean_lowerpopulation_mean_upper
Action6.060.05896   6.12
Adventure6.170.08126.096.25
Animation6.720.113 6.616.83
Biography6.990.06736.927.06
Comedy6.150.03226.126.18
Crime6.4 0.05436.356.45
Documentary7.3 0.05217.257.35
Drama6.6 0.02336.586.63
Family6.230.11  6.126.33
Fantasy5.990.09775.896.09
History6.880.103 6.786.98
Horror5.330.05785.275.39
Music6.820.111 6.716.93
Mystery6.140.08316.056.22
Romance6.360.04396.316.4 
SciFi5.8 0.113 5.695.91
Sport6.590.142 6.456.73
Thriller5.950.05655.9 6.01
War6.950.125 6.837.08
Visualising the Genres against Rating

ANOVA_initial_vis <- ANOVA_initial %>% 
  ggplot(aes(x = Final_Genre, y = sample_mean)) +
  geom_point(size = 1.5) +
  geom_errorbar(aes (ymin = population_mean_lower,
                     ymax = population_mean_upper), width = 0.5, size = 0.8) +
  theme_fivethirtyeight() +
  labs(x = "Genres",
       y = "Mean Rating of movies",
       color = " ") +
  coord_cartesian(ylim = c(5,7.5)) +
  theme_linedraw () +
  theme(legend.position = "top") + 
    theme(axis.text.x = element_text(angle = 20))
ggplotly(ANOVA_initial_vis)

1.1.4 Visualization: Notched boxplot with Mean + confidence interval

boxplot_vis <- dataset %>% 
  ggplot(aes(x = reorder(Final_Genre,Rating), y = Rating)) +
  geom_boxplot(aes(color = Final_Genre),
               notch = T) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "errorbar",
               color = "purple",
               fun.args = (conf.int = 0.95),
               size = 0.5) + 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "pointrange",
               color = "purple",
               fun.args = (conf.int = 0.95),
               size = 0.1) + 
  geom_jitter(aes(color = Final_Genre),
              alpha = 0.05) + 
  labs(title = "Rating Analysis of Genres from IMDB dataset",
       subtitle = "Genres",
       caption = "Source: IMDB",
       x = "Genre Names",
       y = "Rating Scores") +
theme_linedraw()+
    theme(axis.text.x = element_text(angle = 20)) + 
  theme(legend.position = "none",
        axis.title = element_text())
boxplot_vis

This visualisation shows the genres against ratings of the IMDB dataset. It is a boxplot arranged from highest median to lowest median rating which also shows the mean and confidence interval.

However we can still further explore these significant results using post hoc test. This test will allow pairwise comparison of genres

1.2.1 Post Hoc Test

Pairwise comparison between 19 genres

pairwise_significant <- dataset %>% 
  aov(Rating ~ Final_Genre,data = .) %>% 
  TukeyHSD(., which = "Final_Genre") %>% 
  tidy() %>% 
  rename(pair = contrast, 
         sample_M_diff = estimate, 
         population_M_diff_lower = conf.low, 
         population_M_diff_higher = conf.high, 
         adjusted_p = adj.p.value) %>% 
  select(-term, -null.value) %>% 
  mutate(include_zero = ifelse(population_M_diff_lower <0 & population_M_diff_higher > 0,"yes", "no")) %>% 
  filter(include_zero == "no")

as_tibble(pairwise_significant)

pairsample_M_diffpopulation_M_diff_lowerpopulation_M_diff_higheradjusted_pinclude_zero
Animation-Action0.6640.436   0.891  0       no
Biography-Action0.9320.753   1.11   0       no
Crime-Action0.3430.212   0.474  0       no
Documentary-Action1.24 1.1     1.39   0       no
Drama-Action0.5480.449   0.648  0       no
History-Action0.8240.602   1.05   0       no
Horror-Action-0.724-0.862   -0.586  0       no
Music-Action0.7640.536   0.992  0       no
Romance-Action0.3010.178   0.425  0       no
SciFi-Action-0.255-0.454   -0.056  0.00102 no
Sport-Action0.5370.27    0.804  0       no
War-Action0.8960.633   1.16   0       no
Animation-Adventure0.5530.311   0.796  0       no
Biography-Adventure0.8210.624   1.02   0       no
Crime-Adventure0.2320.0765  0.388  2.65e-05no
Documentary-Adventure1.13 0.963   1.31   0       no
Drama-Adventure0.4380.308   0.568  0       no
History-Adventure0.7130.476   0.95   0       no
Horror-Adventure-0.835-0.997   -0.673  0       no
Music-Adventure0.6530.41    0.896  0       no
Romance-Adventure0.1910.0411  0.34   0.00114 no
SciFi-Adventure-0.366-0.581   -0.15   4.19e-07no
Sport-Adventure0.4260.146   0.706  1.47e-05no
Thriller-Adventure-0.214-0.367   -0.0607 0.000147no
War-Adventure0.7850.509   1.06   0       no
Biography-Animation0.2680.00757 0.528  0.0356  no
Comedy-Animation-0.569-0.786   -0.352  0       no
Crime-Animation-0.321-0.551   -0.0899 0.000167no
Documentary-Animation0.5810.34    0.822  0       no
Family-Animation-0.495-0.767   -0.223  0       no
Fantasy-Animation-0.73 -0.993   -0.468  0       no
Horror-Animation-1.39 -1.62    -1.15   0       no
Mystery-Animation-0.583-0.836   -0.33   0       no
Romance-Animation-0.362-0.589   -0.136  3.18e-06no
SciFi-Animation-0.919-1.19    -0.644  0       no
Thriller-Animation-0.767-0.996   -0.538  0       no
Comedy-Biography-0.837-1       -0.672  0       no
Crime-Biography-0.589-0.771   -0.406  0       no
Documentary-Biography0.3130.118   0.509  3.01e-06no
Drama-Biography-0.383-0.545   -0.222  0       no
Family-Biography-0.763-0.996   -0.53   0       no
Fantasy-Biography-0.998-1.22    -0.777  0       no
Horror-Biography-1.66 -1.84    -1.47   0       no
Mystery-Biography-0.851-1.06    -0.641  0       no
Romance-Biography-0.63 -0.808   -0.453  0       no
SciFi-Biography-1.19 -1.42    -0.951  0       no
Sport-Biography-0.395-0.691   -0.099  0.000428no
Thriller-Biography-1.03 -1.21    -0.855  0       no
Crime-Comedy0.2480.136   0.36   0       no
Documentary-Comedy1.15 1.02    1.28   0       no
Drama-Comedy0.4540.381   0.526  0       no
History-Comedy0.7290.518   0.94   0       no
Horror-Comedy-0.819-0.939   -0.699  0       no
Music-Comedy0.6690.451   0.887  0       no
Romance-Comedy0.2070.103   0.31   0       no
SciFi-Comedy-0.35 -0.536   -0.163  0       no
Sport-Comedy0.4420.184   0.701  2.74e-07no
Thriller-Comedy-0.198-0.306   -0.0897 0       no
War-Comedy0.8010.547   1.05   0       no
Documentary-Crime0.9020.748   1.06   0       no
Drama-Crime0.2050.0984  0.312  0       no
Fantasy-Crime-0.41 -0.596   -0.224  0       no
History-Crime0.4810.255   0.706  0       no
Horror-Crime-1.07 -1.21    -0.924  0       no
Music-Crime0.4210.189   0.652  0       no
Mystery-Crime-0.262-0.434   -0.0908 1.26e-05no
SciFi-Crime-0.598-0.801   -0.395  0       no
Thriller-Crime-0.446-0.58    -0.312  0       no
War-Crime0.5530.287   0.818  0       no
Drama-Documentary-0.696-0.824   -0.569  0       no
Family-Documentary-1.08 -1.29    -0.865  0       no
Fantasy-Documentary-1.31 -1.51    -1.11   0       no
History-Documentary-0.421-0.657   -0.185  3.17e-08no
Horror-Documentary-1.97 -2.13    -1.81   0       no
Music-Documentary-0.481-0.723   -0.239  0       no
Mystery-Documentary-1.16 -1.35    -0.979  0       no
Romance-Documentary-0.943-1.09    -0.796  0       no
SciFi-Documentary-1.5  -1.71    -1.29   0       no
Sport-Documentary-0.708-0.987   -0.429  0       no
Thriller-Documentary-1.35 -1.5     -1.2    0       no
War-Documentary-0.349-0.624   -0.0746 0.0012  no
Family-Drama-0.38 -0.559   -0.2    0       no
Fantasy-Drama-0.615-0.78    -0.45   0       no
History-Drama0.2750.0668  0.484  0.000544no
Horror-Drama-1.27 -1.39    -1.16   0       no
Music-Drama0.2150.0003650.431  0.049   no
Mystery-Drama-0.468-0.616   -0.319  0       no
Romance-Drama-0.247-0.345   -0.149  0       no
SciFi-Drama-0.803-0.987   -0.62   0       no
Thriller-Drama-0.651-0.754   -0.549  0       no
War-Drama0.3470.0959  0.599  0.000192no
Fantasy-Family-0.235-0.471   -5.2e-050.0499  no
History-Family0.6550.387   0.923  0       no
Horror-Family-0.893-1.1     -0.69   0       no
Music-Family0.5950.322   0.868  0       no
SciFi-Family-0.424-0.673   -0.175  3.35e-07no
Sport-Family0.3680.0618  0.674  0.0035  no
Thriller-Family-0.272-0.469   -0.0751 0.000191no
War-Family0.7270.425   1.03   0       no
History-Fantasy0.89 0.632   1.15   0       no
Horror-Fantasy-0.658-0.848   -0.467  0       no
Music-Fantasy0.8310.567   1.09   0       no
Romance-Fantasy0.3680.187   0.549  0       no
Sport-Fantasy0.6040.306   0.901  0       no
War-Fantasy0.9620.668   1.26   0       no
Horror-History-1.55 -1.78    -1.32   0       no
Mystery-History-0.743-0.991   -0.495  0       no
Romance-History-0.522-0.743   -0.301  0       no
SciFi-History-1.08 -1.35    -0.808  0       no
Thriller-History-0.927-1.15    -0.703  0       no
Music-Horror1.49 1.25    1.72   0       no
Mystery-Horror0.8050.628   0.982  0       no
Romance-Horror1.03 0.889   1.16   0       no
SciFi-Horror0.4690.262   0.677  0       no
Sport-Horror1.26 0.988   1.53   0       no
Thriller-Horror0.6210.481   0.762  0       no
War-Horror1.62 1.35    1.89   0       no
Mystery-Music-0.683-0.937   -0.43   0       no
Romance-Music-0.462-0.69    -0.235  0       no
SciFi-Music-1.02 -1.29    -0.743  0       no
Thriller-Music-0.867-1.1     -0.637  0       no
Romance-Mystery0.2210.0549  0.387  0.000456no
SciFi-Mystery-0.336-0.563   -0.108  3.55e-05no
Sport-Mystery0.4560.167   0.745  4.83e-06no
Thriller-Mystery-0.184-0.353   -0.0147 0.0172  no
War-Mystery0.8150.53    1.1    0       no
SciFi-Romance-0.556-0.754   -0.358  0       no
Thriller-Romance-0.404-0.531   -0.278  0       no
War-Romance0.5940.332   0.856  0       no
Sport-SciFi0.7920.483   1.1    0       no
War-SciFi1.15 0.846   1.46   0       no
Thriller-Sport-0.64 -0.908   -0.372  0       no
War-Sport0.3590.0056  0.712  0.0416  no
War-Thriller0.9990.735   1.26   0       no
From the original pairwise rows of 19C2 = 171, we are left with 134 significant pairs. Now we evaluate which genres are most significant when it comes to compared with other genres.

1.2.2 Genre signficance comparison

pair_significant <- separate(pairwise_significant,pair, c("first", "second"), sep = "-",remove = F)
pair_significant <- pair_significant[2:3]
glimpse(pair_significant)
## Rows: 134
## Columns: 2
## $ first  <chr> "Animation", "Biography", "Crime", "Documentary", "Drama", "Hi…
## $ second <chr> "Action", "Action", "Action", "Action", "Action", "Action", "A…
sapply(pair_significant, function(x) length(unique(x)))
##  first second 
##     17     18
pairwise_genre <- data.frame(genre = c(pair_significant$first, pair_significant$second))
pairwise_genre %>% 
  group_by(genre) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
genrecount
Documentary18
Horror18
Drama16
SciFi16
Biography15
Crime15
Romance15
Thriller15
War14
Adventure13
Animation13
Comedy13
History13
Music13
Mystery13
Action12
Family12
Fantasy12
Sport12

We now see that all 19 genres have at least 12 other genres with which they have significant results. With a clearer idea of our data, we used the Genre Documentary as our control for the next few models.

The reasoning behind using Documentary comes in 2 forms. Firstly, its mean ratings are significant compared to ALL other genres. Next, it has both the highest mean and median ratings amongst all the genres. This in itself allows us to conduct analysis later for simple linear regression.

1.3 ANOVA assumption check

As we learnt in the final week, ANOVA has 3 checks that need to be met. 1. Independence of genres 2. Rating data is normal in each genre 3. Variability of rating across genres is about equal

For this we will use LEVENE’s test where the null hypothesis is that the variance across all the genres are the same. The alternative hypothesis is that at least one genre has a different variance.

dataset %>% leveneTest(Rating ~ as.factor(Final_Genre), data = .)

DfF valuePr(>F)
1828.61.12e-96
20234         
LEVENE’s test shows significant evidence against the null hypothesis at 1% significance level. This means that at least 1 genre has a different variance. Despite the movies being taken across a 30 year time period with at least 1000 votes per movie, with genres that have at least 100 movies, the variance of ratings amongst the genre still defers.

Model 2: Correlation analysis

Model 2 consists of correlation analysis between Ratings, Genres, Movie Duration and Number of votes on IMDB webpage. This is a statistical method for us to evaluate the strength of relationship between two quantitative variables.

In this case, we want to look at the relationship between Ratings and the other variables. A high correlation means that two or more variables have a strong relationship with each other, while a weak correlation means that the variables are hardly related. This is a pre-requisite as well before we can run regression analysis, which will be shown in the rest of our models.

2.1 Creating new dichotomous variable for Genres

dataset2 <- dataset %>% 
  mutate(Comedy = ifelse(Final_Genre == "Comedy",1,0)) %>% 
  mutate(Mystery = ifelse(Final_Genre == "Mystery",1,0)) %>% 
  mutate(Horror = ifelse(Final_Genre == "Horror",1,0)) %>% 
  mutate(Action = ifelse(Final_Genre == "Action",1,0)) %>% 
  mutate(History = ifelse(Final_Genre == "History",1,0)) %>% 
  mutate(`Sci-Fi` = ifelse(Final_Genre == "Sci-Fi",1,0)) %>% 
  mutate(Drama = ifelse(Final_Genre == "Drama",1,0)) %>% 
  mutate(Thriller = ifelse(Final_Genre == "Thriller",1,0)) %>% 
  mutate(Family = ifelse(Final_Genre == "Family",1,0)) %>% 
  mutate(Romance = ifelse(Final_Genre == "Romance",1,0)) %>% 
  mutate(Biography = ifelse(Final_Genre == "Biography",1,0)) %>% 
  mutate(Adventure = ifelse(Final_Genre == "Adventure",1,0)) %>% 
  mutate(Music = ifelse(Final_Genre == "Music",1,0)) %>% 
  mutate(Crime = ifelse(Final_Genre == "Crime",1,0)) %>% 
  mutate(War = ifelse(Final_Genre == "War",1,0)) %>% 
  mutate(Fantasy = ifelse(Final_Genre == "Fantasy",1,0)) %>% 
  mutate(Animation = ifelse(Final_Genre == "Animation",1,0)) %>% 
  mutate(Documentary = ifelse(Final_Genre == "Documentary",1,0)) %>% 
  mutate(Sport = ifelse(Final_Genre == "Sport",1,0))

2.2 Running correlation matrix

Point Biserial correlation: Target genre vs the rest

correlation_matrix <- dataset2 %>% 
  select(Comedy, Mystery, Horror, Action, History, `Sci-Fi`, Drama, Thriller, Family, Romance,
         Biography, Adventure, Music, Crime, War, Fantasy, Animation, Documentary, Sport, Duration, Num_of_Votes, Rating) %>% 
  as.matrix(.) %>% 
  rcorr(., type = "pearson") %>% #"Pearson" as categorical data not ranked, can't use "spearman" 
  tidy() %>%  #store your correlational matrix into tbl_df
  print(n = nrow(.)) %>% 
  filter(column1 == "Rating") %>% 
  filter(p.value <= 0.05) #Only showing correlation with Rating
## # A tibble: 231 x 5
##     column1      column2       estimate     n    p.value
##     <chr>        <chr>            <dbl> <int>      <dbl>
##   1 Mystery      Comedy        -0.0770  20253   0.      
##   2 Horror       Comedy        -0.104   20253   0.      
##   3 Horror       Mystery       -0.0383  20253   4.81e- 8
##   4 Action       Comedy        -0.126   20253   0.      
##   5 Action       Mystery       -0.0463  20253   4.23e-11
##   6 Action       Horror        -0.0627  20253   0.      
##   7 History      Comedy        -0.0532  20253   3.66e-14
##   8 History      Mystery       -0.0196  20253   5.36e- 3
##   9 History      Horror        -0.0265  20253   1.64e- 4
##  10 History      Action        -0.0320  20253   5.27e- 6
##  11 Sci-Fi       Comedy       NaN       20253 NaN       
##  12 Sci-Fi       Mystery      NaN       20253 NaN       
##  13 Sci-Fi       Horror       NaN       20253 NaN       
##  14 Sci-Fi       Action       NaN       20253 NaN       
##  15 Sci-Fi       History      NaN       20253 NaN       
##  16 Drama        Comedy        -0.278   20253   0.      
##  17 Drama        Mystery       -0.102   20253   0.      
##  18 Drama        Horror        -0.138   20253   0.      
##  19 Drama        Action        -0.167   20253   0.      
##  20 Drama        History       -0.0706  20253   0.      
##  21 Drama        Sci-Fi       NaN       20253 NaN       
##  22 Thriller     Comedy        -0.121   20253   0.      
##  23 Thriller     Mystery       -0.0444  20253   2.61e-10
##  24 Thriller     Horror        -0.0601  20253   0.      
##  25 Thriller     Action        -0.0726  20253   0.      
##  26 Thriller     History       -0.0307  20253   1.28e- 5
##  27 Thriller     Sci-Fi       NaN       20253 NaN       
##  28 Thriller     Drama         -0.160   20253   0.      
##  29 Family       Comedy        -0.0624  20253   0.      
##  30 Family       Mystery       -0.0230  20253   1.08e- 3
##  31 Family       Horror        -0.0311  20253   9.74e- 6
##  32 Family       Action        -0.0375  20253   9.07e- 8
##  33 Family       History       -0.0159  20253   2.40e- 2
##  34 Family       Sci-Fi       NaN       20253 NaN       
##  35 Family       Drama         -0.0829  20253   0.      
##  36 Family       Thriller      -0.0360  20253   3.04e- 7
##  37 Romance      Comedy        -0.129   20253   0.      
##  38 Romance      Mystery       -0.0475  20253   1.32e-11
##  39 Romance      Horror        -0.0643  20253   0.      
##  40 Romance      Action        -0.0777  20253   0.      
##  41 Romance      History       -0.0328  20253   3.00e- 6
##  42 Romance      Sci-Fi       NaN       20253 NaN       
##  43 Romance      Drama         -0.171   20253   0.      
##  44 Romance      Thriller      -0.0745  20253   0.      
##  45 Romance      Family        -0.0385  20253   4.20e- 8
##  46 Biography    Comedy        -0.0703  20253   0.      
##  47 Biography    Mystery       -0.0259  20253   2.33e- 4
##  48 Biography    Horror        -0.0350  20253   6.31e- 7
##  49 Biography    Action        -0.0423  20253   1.75e- 9
##  50 Biography    History       -0.0179  20253   1.10e- 2
##  51 Biography    Sci-Fi       NaN       20253 NaN       
##  52 Biography    Drama         -0.0933  20253   0.      
##  53 Biography    Thriller      -0.0405  20253   8.02e- 9
##  54 Biography    Family        -0.0210  20253   2.86e- 3
##  55 Biography    Romance       -0.0434  20253   6.62e-10
##  56 Adventure    Comedy        -0.0897  20253   0.      
##  57 Adventure    Mystery       -0.0330  20253   2.62e- 6
##  58 Adventure    Horror        -0.0447  20253   2.02e-10
##  59 Adventure    Action        -0.0540  20253   1.51e-14
##  60 Adventure    History       -0.0228  20253   1.18e- 3
##  61 Adventure    Sci-Fi       NaN       20253 NaN       
##  62 Adventure    Drama         -0.119   20253   0.      
##  63 Adventure    Thriller      -0.0517  20253   1.78e-13
##  64 Adventure    Family        -0.0268  20253   1.40e- 4
##  65 Adventure    Romance       -0.0554  20253   3.11e-15
##  66 Adventure    Biography     -0.0301  20253   1.80e- 5
##  67 Music        Comedy        -0.0515  20253   2.36e-13
##  68 Music        Mystery       -0.0189  20253   7.05e- 3
##  69 Music        Horror        -0.0256  20253   2.66e- 4
##  70 Music        Action        -0.0310  20253   1.05e- 5
##  71 Music        History       -0.0131  20253   6.28e- 2
##  72 Music        Sci-Fi       NaN       20253 NaN       
##  73 Music        Drama         -0.0683  20253   0.      
##  74 Music        Thriller      -0.0297  20253   2.42e- 5
##  75 Music        Family        -0.0153  20253   2.90e- 2
##  76 Music        Romance       -0.0318  20253   6.20e- 6
##  77 Music        Biography     -0.0173  20253   1.39e- 2
##  78 Music        Adventure     -0.0221  20253   1.70e- 3
##  79 Crime        Comedy        -0.114   20253   0.      
##  80 Crime        Mystery       -0.0421  20253   2.10e- 9
##  81 Crime        Horror        -0.0569  20253   4.44e-16
##  82 Crime        Action        -0.0688  20253   0.      
##  83 Crime        History       -0.0291  20253   3.54e- 5
##  84 Crime        Sci-Fi       NaN       20253 NaN       
##  85 Crime        Drama         -0.152   20253   0.      
##  86 Crime        Thriller      -0.0659  20253   0.      
##  87 Crime        Family        -0.0341  20253   1.21e- 6
##  88 Crime        Romance       -0.0706  20253   0.      
##  89 Crime        Biography     -0.0384  20253   4.58e- 8
##  90 Crime        Adventure     -0.0490  20253   2.96e-12
##  91 Crime        Music         -0.0281  20253   6.29e- 5
##  92 War          Comedy        -0.0437  20253   4.99e-10
##  93 War          Mystery       -0.0161  20253   2.22e- 2
##  94 War          Horror        -0.0218  20253   1.96e- 3
##  95 War          Action        -0.0263  20253   1.84e- 4
##  96 War          History       -0.0111  20253   1.14e- 1
##  97 War          Sci-Fi       NaN       20253 NaN       
##  98 War          Drama         -0.0580  20253   0.      
##  99 War          Thriller      -0.0252  20253   3.38e- 4
## 100 War          Family        -0.0130  20253   6.38e- 2
## 101 War          Romance       -0.0270  20253   1.25e- 4
## 102 War          Biography     -0.0147  20253   3.68e- 2
## 103 War          Adventure     -0.0187  20253   7.70e- 3
## 104 War          Music         -0.0107  20253   1.26e- 1
## 105 War          Crime         -0.0239  20253   6.81e- 4
## 106 Fantasy      Comedy        -0.0685  20253   0.      
## 107 Fantasy      Mystery       -0.0252  20253   3.35e- 4
## 108 Fantasy      Horror        -0.0341  20253   1.21e- 6
## 109 Fantasy      Action        -0.0412  20253   4.47e- 9
## 110 Fantasy      History       -0.0174  20253   1.33e- 2
## 111 Fantasy      Sci-Fi       NaN       20253 NaN       
## 112 Fantasy      Drama         -0.0909  20253   0.      
## 113 Fantasy      Thriller      -0.0395  20253   1.90e- 8
## 114 Fantasy      Family        -0.0204  20253   3.65e- 3
## 115 Fantasy      Romance       -0.0423  20253   1.77e- 9
## 116 Fantasy      Biography     -0.0230  20253   1.06e- 3
## 117 Fantasy      Adventure     -0.0294  20253   2.93e- 5
## 118 Fantasy      Music         -0.0168  20253   1.66e- 2
## 119 Fantasy      Crime         -0.0374  20253   9.97e- 8
## 120 Fantasy      War           -0.0143  20253   4.19e- 2
## 121 Animation    Comedy        -0.0517  20253   1.89e-13
## 122 Animation    Mystery       -0.0190  20253   6.83e- 3
## 123 Animation    Horror        -0.0257  20253   2.51e- 4
## 124 Animation    Action        -0.0311  20253   9.71e- 6
## 125 Animation    History       -0.0131  20253   6.18e- 2
## 126 Animation    Sci-Fi       NaN       20253 NaN       
## 127 Animation    Drama         -0.0686  20253   0.      
## 128 Animation    Thriller      -0.0298  20253   2.25e- 5
## 129 Animation    Family        -0.0154  20253   2.84e- 2
## 130 Animation    Romance       -0.0319  20253   5.69e- 6
## 131 Animation    Biography     -0.0173  20253   1.36e- 2
## 132 Animation    Adventure     -0.0221  20253   1.62e- 3
## 133 Animation    Music         -0.0127  20253   7.07e- 2
## 134 Animation    Crime         -0.0282  20253   5.88e- 5
## 135 Animation    War           -0.0108  20253   1.25e- 1
## 136 Animation    Fantasy       -0.0169  20253   1.61e- 2
## 137 Documentary  Comedy        -0.0917  20253   0.      
## 138 Documentary  Mystery       -0.0338  20253   1.55e- 6
## 139 Documentary  Horror        -0.0457  20253   7.83e-11
## 140 Documentary  Action        -0.0552  20253   4.00e-15
## 141 Documentary  History       -0.0233  20253   9.08e- 4
## 142 Documentary  Sci-Fi       NaN       20253 NaN       
## 143 Documentary  Drama         -0.122   20253   0.      
## 144 Documentary  Thriller      -0.0529  20253   5.02e-14
## 145 Documentary  Family        -0.0274  20253   9.87e- 5
## 146 Documentary  Romance       -0.0566  20253   8.88e-16
## 147 Documentary  Biography     -0.0308  20253   1.16e- 5
## 148 Documentary  Adventure     -0.0393  20253   2.16e- 8
## 149 Documentary  Music         -0.0226  20253   1.33e- 3
## 150 Documentary  Crime         -0.0501  20253   9.49e-13
## 151 Documentary  War           -0.0191  20253   6.43e- 3
## 152 Documentary  Fantasy       -0.0300  20253   1.93e- 5
## 153 Documentary  Animation     -0.0226  20253   1.27e- 3
## 154 Sport        Comedy        -0.0428  20253   1.07e- 9
## 155 Sport        Mystery       -0.0158  20253   2.49e- 2
## 156 Sport        Horror        -0.0213  20253   2.40e- 3
## 157 Sport        Action        -0.0258  20253   2.45e- 4
## 158 Sport        History       -0.0109  20253   1.21e- 1
## 159 Sport        Sci-Fi       NaN       20253 NaN       
## 160 Sport        Drama         -0.0569  20253   4.44e-16
## 161 Sport        Thriller      -0.0247  20253   4.41e- 4
## 162 Sport        Family        -0.0128  20253   6.91e- 2
## 163 Sport        Romance       -0.0264  20253   1.69e- 4
## 164 Sport        Biography     -0.0144  20253   4.07e- 2
## 165 Sport        Adventure     -0.0184  20253   8.97e- 3
## 166 Sport        Music         -0.0105  20253   1.34e- 1
## 167 Sport        Crime         -0.0234  20253   8.66e- 4
## 168 Sport        War           -0.00894 20253   2.03e- 1
## 169 Sport        Fantasy       -0.0140  20253   4.61e- 2
## 170 Sport        Animation     -0.0106  20253   1.32e- 1
## 171 Sport        Documentary   -0.0188  20253   7.54e- 3
## 172 Duration     Comedy        -0.0858  20253   0.      
## 173 Duration     Mystery       -0.00662 20253   3.46e- 1
## 174 Duration     Horror        -0.105   20253   0.      
## 175 Duration     Action         0.0947  20253   0.      
## 176 Duration     History        0.0733  20253   0.      
## 177 Duration     Sci-Fi       NaN       20253 NaN       
## 178 Duration     Drama          0.0910  20253   0.      
## 179 Duration     Thriller      -0.0236  20253   8.01e- 4
## 180 Duration     Family        -0.0416  20253   3.12e- 9
## 181 Duration     Romance        0.0368  20253   1.66e- 7
## 182 Duration     Biography      0.0627  20253   0.      
## 183 Duration     Adventure     -0.00781 20253   2.66e- 1
## 184 Duration     Music          0.00155 20253   8.26e- 1
## 185 Duration     Crime          0.0259  20253   2.30e- 4
## 186 Duration     War            0.0464  20253   3.86e-11
## 187 Duration     Fantasy       -0.0130  20253   6.47e- 2
## 188 Duration     Animation     -0.0860  20253   0.      
## 189 Duration     Documentary   -0.102   20253   0.      
## 190 Duration     Sport          0.00712 20253   3.11e- 1
## 191 Num_of_Votes Comedy        -0.0356  20253   3.90e- 7
## 192 Num_of_Votes Mystery        0.0116  20253   9.77e- 2
## 193 Num_of_Votes Horror        -0.0231  20253   1.03e- 3
## 194 Num_of_Votes Action         0.0820  20253   0.      
## 195 Num_of_Votes History       -0.00583 20253   4.07e- 1
## 196 Num_of_Votes Sci-Fi       NaN       20253 NaN       
## 197 Num_of_Votes Drama         -0.0450  20253   1.55e-10
## 198 Num_of_Votes Thriller       0.00746 20253   2.88e- 1
## 199 Num_of_Votes Family        -0.00404 20253   5.65e- 1
## 200 Num_of_Votes Romance       -0.0311  20253   9.30e- 6
## 201 Num_of_Votes Biography      0.0139  20253   4.75e- 2
## 202 Num_of_Votes Adventure      0.0893  20253   0.      
## 203 Num_of_Votes Music         -0.0179  20253   1.09e- 2
## 204 Num_of_Votes Crime          0.00673 20253   3.38e- 1
## 205 Num_of_Votes War           -0.00350 20253   6.18e- 1
## 206 Num_of_Votes Fantasy        0.0324  20253   3.97e- 6
## 207 Num_of_Votes Animation      0.0257  20253   2.55e- 4
## 208 Num_of_Votes Documentary   -0.0531  20253   3.84e-14
## 209 Num_of_Votes Sport         -0.00825 20253   2.40e- 1
## 210 Num_of_Votes Duration       0.161   20253   0.      
## 211 Rating       Comedy        -0.0817  20253   0.      
## 212 Rating       Mystery       -0.0324  20253   4.03e- 6
## 213 Rating       Horror        -0.222   20253   0.      
## 214 Rating       Action        -0.0745  20253   0.      
## 215 Rating       History        0.0614  20253   0.      
## 216 Rating       Sci-Fi       NaN       20253 NaN       
## 217 Rating       Drama          0.159   20253   0.      
## 218 Rating       Thriller      -0.0977  20253   0.      
## 219 Rating       Family        -0.0146  20253   3.81e- 2
## 220 Rating       Romance        0.00617 20253   3.80e- 1
## 221 Rating       Biography      0.0973  20253   0.      
## 222 Rating       Adventure     -0.0320  20253   5.18e- 6
## 223 Rating       Music          0.0529  20253   4.84e-14
## 224 Rating       Crime          0.0155  20253   2.69e- 2
## 225 Rating       War            0.0571  20253   4.44e-16
## 226 Rating       Fantasy       -0.0502  20253   8.82e-13
## 227 Rating       Animation      0.0422  20253   1.95e- 9
## 228 Rating       Documentary    0.188   20253   0.      
## 229 Rating       Sport          0.0234  20253   8.52e- 4
## 230 Rating       Duration       0.275   20253   0.      
## 231 Rating       Num_of_Votes   0.217   20253   0.

From the correlation table, we can see that all genres have significant correlation with Rating. Our group also looked at Rating vs other numerical variables such as Duration and number of votes on movies. This generated a positive and significant correlation as well. Aside from Genre, duration of movies will be the second explanatory variable we will analyse later on.

2.3 Correlation Heatmap

correlation_matrix_2  <- correlation_matrix[1:3]

heatmap <- correlation_matrix_2 %>% 
  ggplot(mapping = aes(y = column1, x = column2)) + 
  geom_tile_interactive(aes(fill = estimate, tooltip = round(estimate,4), data_id = column2)) + 
  scale_fill_distiller(palette = "YlGnBu") + 
  labs(x= "Genre",
       y = "Rating",
       title = "Correlation Heat Map",
       subtitle = "Genres vs Rating") +
  theme_linedraw() + 
  theme(legend.position="top") +
  theme(legend.title = element_text(color = "blue", size = 14),
        legend.text = element_text(size = 10)) +
  theme(aspect.ratio=1/15) + 
  theme(axis.text.x = element_text(angle = 20)) + 
  theme(axis.text = element_text((size = 6)))

girafe(ggobj = heatmap)

Model 3: Simple Linear Regression analysis

Model 3 consists of simple linear regression, which is a statistical method that allows us to summarise and study relationships between Ratings and Genre. Since regression requires both variables to be continuous variable, our Final_Genre will be converted into a numerical variable through as.factor.

In order to ensure Time Lag in our analysis, we use the “Post-Test” only design where we set Documentary Genre as the control for it to become a pre-test measure. The reason why we selected documentary is from our previous Anova analysis, where it is seen as having significant mean differences with all other genres, as well as having the highest average ratings.

3.1 Setting Documentary as control

dataset3 <- dataset %>% #change original_data to original_data_test
  mutate(Final_Genre = fct_relevel(as_factor(Final_Genre), "Documentary"))

3.2 Constructing Linear Regression

lm_regression <- lm(Rating ~ Final_Genre, data = dataset3)

3.3 Looking at variable and model-level information

lm_tidyverse_variables <- lm_regression %>% tidy() #variable-level information
lm_tidyverse_model <- lm_regression %>% glance

combined_model <- export_summs(lm_regression, model.names = "Regression of Genre against Ratings",
                  error_format = "p = {p.value}",
                  digits = 3)
combined_model
Regression of Genre against Ratings
(Intercept)7.301 ***
p = 0.000    
Final_GenreDrama-0.696 ***
p = 0.000    
Final_GenreMystery-1.164 ***
p = 0.000    
Final_GenreHorror-1.969 ***
p = 0.000    
Final_GenreAction-1.245 ***
p = 0.000    
Final_GenreComedy-1.150 ***
p = 0.000    
Final_GenreCrime-0.902 ***
p = 0.000    
Final_GenreRomance-0.943 ***
p = 0.000    
Final_GenreSciFi-1.500 ***
p = 0.000    
Final_GenreThriller-1.348 ***
p = 0.000    
Final_GenreFamily-1.076 ***
p = 0.000    
Final_GenreAdventure-1.134 ***
p = 0.000    
Final_GenreBiography-0.313 ***
p = 0.000    
Final_GenreMusic-0.481 ***
p = 0.000    
Final_GenreWar-0.349 ***
p = 0.000    
Final_GenreAnimation-0.581 ***
p = 0.000    
Final_GenreFantasy-1.312 ***
p = 0.000    
Final_GenreHistory-0.421 ***
p = 0.000    
Final_GenreSport-0.708 ***
p = 0.000    
N20253        
R20.150    
*** p < 0.001; ** p < 0.01; * p < 0.05.

3.4 Presenting results using Kable

kable_initial_regression_result <- kable(tidy(lm_regression)) %>% 
  kable_paper("striped", full_width = F) %>% 
  column_spec(c(1,5), bold = T) %>% 
  row_spec(c(2,4,6,8,10,12,14,16,18), bold = T,
           color = "white", background = "blue")
kable_initial_regression_result
term estimate std.error statistic p.value
(Intercept) 7.3011494 0.0339818 214.854473 0.0000000
Final_GenreDrama -0.6964591 0.0363377 -19.166307 0.0000000
Final_GenreMystery -1.1642319 0.0526798 -22.100157 0.0000000
Final_GenreHorror -1.9692175 0.0453856 -43.388610 0.0000000
Final_GenreAction -1.2449082 0.0422946 -29.434210 0.0000000
Final_GenreComedy -1.1501081 0.0375864 -30.599084 0.0000000
Final_GenreCrime -0.9018211 0.0437486 -20.613696 0.0000000
Final_GenreRomance -0.9434237 0.0419472 -22.490736 0.0000000
Final_GenreSciFi -1.4997330 0.0609605 -24.601710 0.0000000
Final_GenreThriller -1.3479224 0.0429105 -31.412412 0.0000000
Final_GenreFamily -1.0760143 0.0599875 -17.937314 0.0000000
Final_GenreAdventure -1.1342161 0.0485833 -23.345787 0.0000000
Final_GenreBiography -0.3131409 0.0555960 -5.632437 0.0000000
Final_GenreMusic -0.4809913 0.0687648 -6.994727 0.0000000
Final_GenreWar -0.3492369 0.0780746 -4.473117 0.0000078
Final_GenreAnimation -0.5811494 0.0685607 -8.476418 0.0000000
Final_GenreFantasy -1.3115098 0.0564908 -23.216348 0.0000000
Final_GenreHistory -0.4211494 0.0671087 -6.275626 0.0000000
Final_GenreSport -0.7079676 0.0793231 -8.925110 0.0000000

From regression table, our group was able to identify the R^2 value as 0.15, with all the estimates being significant based on p value. This implies that the 15% of variance in Ratings can be explained by genre.

3.5 Mean Rating and confidence intervals of Genres

Our group then went ahead in calculating the confidence intervals of Genre, as well as run ANOVA modelling, this time with Documentary as a control variable.

ANOVA_final <- dataset3 %>% 
  group_by(Final_Genre) %>% 
  dplyr::summarise(sample_mean = mean(Rating,na.rm = T), #remove NA values
            confidence_intervals = Confidence_Interval_calculation(Rating, .95)) %>% 
  mutate(population_mean_lower = sample_mean - confidence_intervals,
         population_mean_upper = sample_mean + confidence_intervals)
## `summarise()` ungrouping output (override with `.groups` argument)
ANOVA_final

Final_Genresample_meanconfidence_intervalspopulation_mean_lowerpopulation_mean_upper
Documentary7.3 0.05217.257.35
Drama6.6 0.02336.586.63
Mystery6.140.08316.056.22
Horror5.330.05785.275.39
Action6.060.05896   6.12
Comedy6.150.03226.126.18
Crime6.4 0.05436.356.45
Romance6.360.04396.316.4 
SciFi5.8 0.113 5.695.91
Thriller5.950.05655.9 6.01
Family6.230.11  6.126.33
Adventure6.170.08126.096.25
Biography6.990.06736.927.06
Music6.820.111 6.716.93
War6.950.125 6.837.08
Animation6.720.113 6.616.83
Fantasy5.990.09775.896.09
History6.880.103 6.786.98
Sport6.590.142 6.456.73
From this table, our group observes that the mean of Documentary Genre at 95% confidence interval is significantly higher than the rest of the genre due to having no overlaps. This support our analysis earlier.

3.6 Error bar visualisation of Genre and Sample mean

We then proceeded with using Error Bar visualization to show the mean value of each genre at 95% confidence interval.

ANOVA_final_vis <- ANOVA_final %>% 
  ggplot(aes(x = Final_Genre, y = sample_mean)) +
  geom_point(size = 1.5) +
  geom_errorbar(aes (ymin = population_mean_lower,
                     ymax = population_mean_upper), width = 0.5, size = 0.8) +
  theme_fivethirtyeight() +
  geom_hline(yintercept = 7.25, color = "deepskyblue4", size = 1.5)  +
  geom_hline(aes(yintercept = mean(sample_mean)), color = "red", size = 1.5)  +
  labs(x = "Genres",
       y = "Mean Rating of movies",
       title = "Mean Rating of Genres at 95% Confidence Interval",
       subtitle = "Ratings for Documentary is significantly better than the other Genres.\n7 genres: Drama, Biography, Music, War, Animation, History and Sport have significantly\n better ratings than the mean ratings across all genres.",
       caption = "Source: IMDB Database") +
  coord_cartesian(ylim = c(5,7.5)) +
  theme_linedraw () +
  theme(legend.position = "top") +
   theme(axis.text.x = element_text(angle = 90))

ggplotly(ANOVA_final_vis)

The blue horizontal line indicates the lower confidence interval of the mean of Documentary Genre. We can clearly see that the mean rating of Documentary is significantly higher than the rest of Genres as explained earlier. The red horizontal line indicates the mean rating of all genres. Our group observes that there are 7 genre, aside from Documentary, with mean rating higher than the overall mean across all genres. They are Drama, Biography, Music, War, Animation, History and Sport.

3.7 Linear model assumption check

We seek to confirm the Linear Model Assumption by running GVLMA.

gvlma(lm_regression) 
## 
## Call:
## lm(formula = Rating ~ Final_Genre, data = dataset3)
## 
## Coefficients:
##          (Intercept)      Final_GenreDrama    Final_GenreMystery  
##               7.3011               -0.6965               -1.1642  
##    Final_GenreHorror     Final_GenreAction     Final_GenreComedy  
##              -1.9692               -1.2449               -1.1501  
##     Final_GenreCrime    Final_GenreRomance      Final_GenreSciFi  
##              -0.9018               -0.9434               -1.4997  
##  Final_GenreThriller     Final_GenreFamily  Final_GenreAdventure  
##              -1.3479               -1.0760               -1.1342  
## Final_GenreBiography      Final_GenreMusic        Final_GenreWar  
##              -0.3131               -0.4810               -0.3492  
## Final_GenreAnimation    Final_GenreFantasy    Final_GenreHistory  
##              -0.5811               -1.3115               -0.4211  
##     Final_GenreSport  
##              -0.7080  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm_regression) 
## 
##                                  Value       p-value                   Decision
## Global Stat        475.912162035394260 0.00000000000 Assumptions NOT satisfied!
## Skewness           443.597621964021528 0.00000000000 Assumptions NOT satisfied!
## Kurtosis            29.719976833749126 0.00000004992 Assumptions NOT satisfied!
## Link Function       -0.000000000009577 1.00000000000    Assumptions acceptable.
## Heteroscedasticity   2.594563237633192 0.10723099616    Assumptions acceptable.
autoplot(gvlma(lm_regression))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

From the results, our group noticed that the skewness assumption is NOT met.

We then proceed with modelling to check the skewness of Ratings variable, whether is it left or right skewed.

skewness_graph <- lm_regression %>% 
  ggplot(aes(x=Rating)) +
  geom_density(fill = "tomato3", alpha = 0.3)

From the visualization, we noticed that the graph is left-skewed.

Our group chose to utilize square-root transformation to fix the skewed Ratings variable and ran the GVLMA again.

original_data_4 <- dataset3
original_data_4$Rating <- sqrt(max(dataset3$Rating + 1) - dataset3$Rating)

lm_regression2 <- lm(Rating ~ Final_Genre, data = original_data_4)
summary(lm_regression2)
## 
## Call:
## lm(formula = Rating ~ Final_Genre, data = original_data_4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02033 -0.15199 -0.01079  0.15300  0.88817 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          1.719508   0.008527 201.648 < 0.0000000000000002 ***
## Final_GenreDrama     0.189396   0.009118  20.771 < 0.0000000000000002 ***
## Final_GenreMystery   0.305758   0.013219  23.130 < 0.0000000000000002 ***
## Final_GenreHorror    0.499099   0.011389  43.823 < 0.0000000000000002 ***
## Final_GenreAction    0.321518   0.010613  30.294 < 0.0000000000000002 ***
## Final_GenreComedy    0.303307   0.009432  32.158 < 0.0000000000000002 ***
## Final_GenreCrime     0.240639   0.010978  21.920 < 0.0000000000000002 ***
## Final_GenreRomance   0.253969   0.010526  24.128 < 0.0000000000000002 ***
## Final_GenreSciFi     0.385809   0.015297  25.221 < 0.0000000000000002 ***
## Final_GenreThriller  0.349629   0.010768  32.470 < 0.0000000000000002 ***
## Final_GenreFamily    0.280804   0.015053  18.654 < 0.0000000000000002 ***
## Final_GenreAdventure 0.294413   0.012191  24.149 < 0.0000000000000002 ***
## Final_GenreBiography 0.088972   0.013951   6.377   0.0000000001839750 ***
## Final_GenreMusic     0.130981   0.017256   7.591   0.0000000000000332 ***
## Final_GenreWar       0.095738   0.019592   4.887   0.0000010334410259 ***
## Final_GenreAnimation 0.157359   0.017204   9.146 < 0.0000000000000002 ***
## Final_GenreFantasy   0.341001   0.014176  24.055 < 0.0000000000000002 ***
## Final_GenreHistory   0.115830   0.016840   6.878   0.0000000000062354 ***
## Final_GenreSport     0.189592   0.019905   9.525 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2386 on 20234 degrees of freedom
## Multiple R-squared:  0.1495, Adjusted R-squared:  0.1487 
## F-statistic: 197.5 on 18 and 20234 DF,  p-value: < 0.00000000000000022
gvlma(lm_regression2)
## 
## Call:
## lm(formula = Rating ~ Final_Genre, data = original_data_4)
## 
## Coefficients:
##          (Intercept)      Final_GenreDrama    Final_GenreMystery  
##              1.71951               0.18940               0.30576  
##    Final_GenreHorror     Final_GenreAction     Final_GenreComedy  
##              0.49910               0.32152               0.30331  
##     Final_GenreCrime    Final_GenreRomance      Final_GenreSciFi  
##              0.24064               0.25397               0.38581  
##  Final_GenreThriller     Final_GenreFamily  Final_GenreAdventure  
##              0.34963               0.28080               0.29441  
## Final_GenreBiography      Final_GenreMusic        Final_GenreWar  
##              0.08897               0.13098               0.09574  
## Final_GenreAnimation    Final_GenreFantasy    Final_GenreHistory  
##              0.15736               0.34100               0.11583  
##     Final_GenreSport  
##              0.18959  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm_regression2) 
## 
##                                Value   p-value                   Decision
## Global Stat        17.98112018701902 0.0012446 Assumptions NOT satisfied!
## Skewness            1.23195064494689 0.2670280    Assumptions acceptable.
## Kurtosis           14.42910098639386 0.0001455 Assumptions NOT satisfied!
## Link Function      -0.00000000004859 1.0000000    Assumptions acceptable.
## Heteroscedasticity  2.32006855572685 0.1277144    Assumptions acceptable.

Upon running GVLMA analysis on the new model, we realised that while skewness is now fixed, the Global Stat and Kurtosis assumptions are still not satisfied.

Thus, we tried using the sandwich method for the kurtosis method.

summ(lm_regression2, robust = "HC3", cluster = "firm" )
Observations 20253
Dependent variable Rating
Type OLS linear regression
F(18,20234) 197.54
0.15
Adj. R² 0.15
Est. S.E. t val. p
(Intercept) 1.72 0.01 234.08 0.00
Final_GenreDrama 0.19 0.01 23.79 0.00
Final_GenreMystery 0.31 0.01 23.85 0.00
Final_GenreHorror 0.50 0.01 49.92 0.00
Final_GenreAction 0.32 0.01 30.84 0.00
Final_GenreComedy 0.30 0.01 36.18 0.00
Final_GenreCrime 0.24 0.01 23.69 0.00
Final_GenreRomance 0.25 0.01 27.40 0.00
Final_GenreSciFi 0.39 0.02 24.77 0.00
Final_GenreThriller 0.35 0.01 34.33 0.00
Final_GenreFamily 0.28 0.02 17.62 0.00
Final_GenreAdventure 0.29 0.01 23.49 0.00
Final_GenreBiography 0.09 0.01 7.44 0.00
Final_GenreMusic 0.13 0.02 7.90 0.00
Final_GenreWar 0.10 0.02 5.14 0.00
Final_GenreAnimation 0.16 0.02 9.39 0.00
Final_GenreFantasy 0.34 0.01 24.10 0.00
Final_GenreHistory 0.12 0.02 7.38 0.00
Final_GenreSport 0.19 0.02 9.35 0.00
Standard errors: Cluster-robust, type = HC3
gvlma(lm_regression2)
## 
## Call:
## lm(formula = Rating ~ Final_Genre, data = original_data_4)
## 
## Coefficients:
##          (Intercept)      Final_GenreDrama    Final_GenreMystery  
##              1.71951               0.18940               0.30576  
##    Final_GenreHorror     Final_GenreAction     Final_GenreComedy  
##              0.49910               0.32152               0.30331  
##     Final_GenreCrime    Final_GenreRomance      Final_GenreSciFi  
##              0.24064               0.25397               0.38581  
##  Final_GenreThriller     Final_GenreFamily  Final_GenreAdventure  
##              0.34963               0.28080               0.29441  
## Final_GenreBiography      Final_GenreMusic        Final_GenreWar  
##              0.08897               0.13098               0.09574  
## Final_GenreAnimation    Final_GenreFantasy    Final_GenreHistory  
##              0.15736               0.34100               0.11583  
##     Final_GenreSport  
##              0.18959  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm_regression2) 
## 
##                                Value   p-value                   Decision
## Global Stat        17.98112018701902 0.0012446 Assumptions NOT satisfied!
## Skewness            1.23195064494689 0.2670280    Assumptions acceptable.
## Kurtosis           14.42910098639386 0.0001455 Assumptions NOT satisfied!
## Link Function      -0.00000000004859 1.0000000    Assumptions acceptable.
## Heteroscedasticity  2.32006855572685 0.1277144    Assumptions acceptable.

However this method still failed. As such, our group decided to treat this as one of the limitations of our dataset.

Model 4: Multiple Linear Regression Analysis

Multiple Linear Regression is similar to Simple Linear Regression, with the exception that it uses several explanatory variables to explain the response variable. In this case, aside from Genre, we are adding in an additional variable; Movie Duration. We chose Movie Duration due to the earlier correlation results, where it has the highest correlation estimate.

We decided to shortlist the Top 5 Genres within Final Genre for more robust analysis based on their correlation estimates. These genres are Documentary, Drama, Biography, Horror and thriller. all of which have very significant correlations with Ratings.

4.1 Multi Linear Regression without interaction

original_data_test <- dataset %>% 
  filter(Final_Genre == "Documentary" | Final_Genre == "Drama" | Final_Genre == "Biography" | Final_Genre == "Horror" | Final_Genre == "Thriller")

original_data_3 <- original_data_test %>% #change original_data to original_data_test
  mutate(Final_Genre = as_factor(Final_Genre))

multi_test <- lm(Rating ~ Final_Genre + Duration, data = original_data_3)

4.1.1 Assess multicollinearity using vifs argument

summ(multi_test, vifs = T)
Observations 9024
Dependent variable Rating
Type OLS linear regression
F(5,9018) 799.11
0.31
Adj. R² 0.31
Est. S.E. t val. p VIF
(Intercept) 5.35 0.05 112.67 0.00 NA
Final_GenreHorror -1.12 0.03 -37.05 0.00 1.07
Final_GenreThriller -0.59 0.03 -22.29 0.00 1.07
Final_GenreBiography 0.32 0.04 7.64 0.00 1.07
Final_GenreDocumentary 0.86 0.03 25.85 0.00 1.07
Duration 0.01 0.00 27.21 0.00 1.07
Standard errors: OLS

From the table, we can see that VIFs value are all below 4. hence, all the explanatory variables that should not be correlated with each other are not correlated.

4.1.2 Mean-centering variables using center argument

The reason why we run mean-centering argument is because for Duration, it does not make sense for a movie rating to increase by the same amount, from 1 to 2 minutes, as compared to 120 to 121 minutes.

summ(multi_test, center = T)
Observations 9024
Dependent variable Rating
Type OLS linear regression
F(5,9018) 799.11
0.31
Adj. R² 0.31
Est. S.E. t val. p
(Intercept) 6.57 0.01 560.38 0.00
Final_GenreHorror -1.12 0.03 -37.05 0.00
Final_GenreThriller -0.59 0.03 -22.29 0.00
Final_GenreBiography 0.32 0.04 7.64 0.00
Final_GenreDocumentary 0.86 0.03 25.85 0.00
Duration 0.01 0.00 27.21 0.00
Standard errors: OLS; Continuous predictors are mean-centered.

4.1.3 Visualisation

First, our group ran effect plot to visualize the mean rating of Genre as compared to Ratings.

effect_plot(multi_test,
            pred = Final_Genre,
            interval = T,
            plot.points = T) +
  ylim(5,8) +
  annotate("text",
           x = 1,
           y = 7.7,
           label = "paste(italic(R)^2, \" = 0.31\")",
           size = 5,
           color = "tomato3",
           parse = T)
## Warning: Removed 1194 rows containing missing values (geom_point).

Since confidence intervals clearly did not overlap, our group concludes that the mean ratings of the top 5 Genres are significantly different from each other.

Second, our group ran effect plot again, this time to visualize durations as compared to Ratings.

effect_plot(multi_test,
            pred = Duration,
            interval = T,
            plot.points = T) +
  ylim(3,10) +
  xlim(0,300) +
  annotate("text",
           x = 8,
           y = 7.7,
           label = "paste(italic(R)^2, \" = 0.31\")",
           size = 5,
           color = "tomato3",
           parse = T)
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 38 row(s) containing missing values (geom_path).

From this regression line, we can clearly see that there is a positive relationship between Ratings and Duration.

Lastly, our group used plot_summs to visualize the regression coefficients, showing the confidence intervals.

summ(multi_test, confint = T, digits = 4)
Observations 9024
Dependent variable Rating
Type OLS linear regression
F(5,9018) 799.1146
0.3070
Adj. R² 0.3066
Est. 2.5% 97.5% t val. p
(Intercept) 5.3518 5.2587 5.4449 112.6707 0.0000
Final_GenreHorror -1.1166 -1.1757 -1.0575 -37.0451 0.0000
Final_GenreThriller -0.5904 -0.6423 -0.5385 -22.2852 0.0000
Final_GenreBiography 0.3173 0.2359 0.3987 7.6411 0.0000
Final_GenreDocumentary 0.8643 0.7988 0.9299 25.8527 0.0000
Duration 0.0115 0.0107 0.0123 27.2058 0.0000
Standard errors: OLS

From this table, we can infer the range of coefficients of each genre and duration at 95% confidence interval. For example, Horror will be between -1.18 and -1.06 lower than the intercept when looking at its ratings.

This visualization shows the interpretation of variance in regression coefficients as explained in the table above.

plot_summs(multi_test, scale = T, plot.distributions = T)

4.1.4 Checking for Linear Model Assumption

gvlma(multi_test)
## 
## Call:
## lm(formula = Rating ~ Final_Genre + Duration, data = original_data_3)
## 
## Coefficients:
##            (Intercept)       Final_GenreHorror     Final_GenreThriller  
##                 5.3518                 -1.1166                 -0.5904  
##   Final_GenreBiography  Final_GenreDocumentary                Duration  
##                 0.3173                  0.8643                  0.0115  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = multi_test) 
## 
##                        Value p-value                   Decision
## Global Stat        773.62876  0.0000 Assumptions NOT satisfied!
## Skewness           441.68955  0.0000 Assumptions NOT satisfied!
## Kurtosis           200.13394  0.0000 Assumptions NOT satisfied!
## Link Function      131.79032  0.0000 Assumptions NOT satisfied!
## Heteroscedasticity   0.01495  0.9027    Assumptions acceptable.
autoplot(gvlma(multi_test))

Once again, our group ran the GVLMA to check for Linear Model Assumption. Unfortunately, the model fails the test and this will be included as part of our limitations.

4.2 Multiple Linear Regression with interaction

Previously, our Multiple Linear Regression without Interaction, investigates only the main effects of Genre and Movie Duration on Ratings. It assumes that the relationship between a given explanatory variable and the outcome is independent of the other explanatory variable. However, this might not be true. For example, Thriller Genre may only be rated lower than Documentary Genre when the movie duration is very short. However, when the movie is long, Thriller Genre may become more highly rated than Documentary Genre.

Hence, we now want to look at the regression coefficients associated with the main and interaction effects of Genre and Movie Duration on Ratings.

multi_test_2 <- lm(Rating ~ (Final_Genre) * Duration, data = original_data_3)

4.2.1 Mean-centering variables using center argument.

The reason why we run mean-centering argument is because for Duration, it does not make sense for a movie rating to increase by the same amount, from 1 to 2 minutes, as compared to 120 to 121 minutes.

summ(multi_test_2, center = T)
Observations 9024
Dependent variable Rating
Type OLS linear regression
F(9,9014) 477.10
0.32
Adj. R² 0.32
Est. S.E. t val. p
(Intercept) 6.57 0.01 565.01 0.00
Final_GenreHorror -1.01 0.04 -28.84 0.00
Final_GenreThriller -0.57 0.03 -21.60 0.00
Final_GenreBiography 0.36 0.04 8.37 0.00
Final_GenreDocumentary 0.80 0.04 22.21 0.00
Duration 0.01 0.00 18.59 0.00
Final_GenreHorror:Duration 0.01 0.00 6.44 0.00
Final_GenreThriller:Duration 0.02 0.00 11.87 0.00
Final_GenreBiography:Duration -0.00 0.00 -2.35 0.02
Final_GenreDocumentary:Duration -0.00 0.00 -2.68 0.01
Standard errors: OLS; Continuous predictors are mean-centered.

4.2.2 Linear model assumption check

gvlma(multi_test_2)
## 
## Call:
## lm(formula = Rating ~ (Final_Genre) * Duration, data = original_data_3)
## 
## Coefficients:
##                     (Intercept)                Final_GenreHorror  
##                        5.546428                        -2.331447  
##             Final_GenreThriller             Final_GenreBiography  
##                       -2.225970                         0.728801  
##          Final_GenreDocumentary                         Duration  
##                        1.195690                         0.009712  
##      Final_GenreHorror:Duration     Final_GenreThriller:Duration  
##                        0.012483                         0.015688  
##   Final_GenreBiography:Duration  Final_GenreDocumentary:Duration  
##                       -0.003498                        -0.003788  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = multi_test_2) 
## 
##                       Value     p-value                   Decision
## Global Stat        757.2003 0.000000000 Assumptions NOT satisfied!
## Skewness           468.0608 0.000000000 Assumptions NOT satisfied!
## Kurtosis           268.1260 0.000000000 Assumptions NOT satisfied!
## Link Function       20.9010 0.000004836 Assumptions NOT satisfied!
## Heteroscedasticity   0.1125 0.737279658    Assumptions acceptable.
autoplot(gvlma(multi_test_2))

Once again, our group ran the GVLMA to check for Linear Model Assumption. Unfortunately, the model fails the test and this will be included as part of our limitations.

4.3 Regression Models Comparison

Over here, our group now wants to check if there is significant improvement in explanatory performance in the 2nd regression model with interaction, as compared to the one without.

export_summs(multi_test, multi_test_2,
             model.names = c("Main Effects", "With Interactions"),
             error_format = "[{conf.low}, {conf.high},
             p = {p.value}",
             digits = 3)

Main EffectsWith Interactions
(Intercept)5.352 ***5.546 ***
[5.259, 5.445,
p = 0.000    
[5.433, 5.660,
p = 0.000    
Final_GenreHorror-1.117 ***-2.331 ***
[-1.176, -1.057,
p = 0.000    
[-2.702, -1.961,
p = 0.000    
Final_GenreThriller-0.590 ***-2.226 ***
[-0.642, -0.538,
p = 0.000    
[-2.502, -1.950,
p = 0.000    
Final_GenreBiography0.317 ***0.729 ***
[0.236, 0.399,
p = 0.000    
[0.387, 1.070,
p = 0.000    
Final_GenreDocumentary0.864 ***1.196 ***
[0.799, 0.930,
p = 0.000    
[0.920, 1.471,
p = 0.000    
Duration0.011 ***0.010 ***
[0.011, 0.012,
p = 0.000    
[0.009, 0.011,
p = 0.000    
Final_GenreHorror:Duration        0.012 ***
        [0.009, 0.016,
p = 0.000    
Final_GenreThriller:Duration        0.016 ***
        [0.013, 0.018,
p = 0.000    
Final_GenreBiography:Duration        -0.003 *  
        [-0.006, -0.001,
p = 0.019    
Final_GenreDocumentary:Duration        -0.004 ** 
        [-0.007, -0.001,
p = 0.007    
N9024        9024        
R20.307    0.323    
*** p < 0.001; ** p < 0.01; * p < 0.05.
The insertion of duration as a moderating variable helped to increase the explanatory power of our model. This is shown by the increase in R^2 value of 0.016.

Next, while explanatory power has increased, we wanted to ensure that this is indeed true by looking at its significance.

Our group now runs annova to check if the increase in R^2 value is indeed significant.

anova(multi_test, multi_test_2)

Res.DfRSSDfSum of SqFPr(>F)
9.02e+036.67e+03       
9.01e+036.52e+034150522.34e-43
Since the p-value is near 0, this proves that Model 2, with duration as a moderating variable, is indeed better than model 1 in terms of explanatory power. We will now move on to probe the interaction effect.

4.4 Probing Interaction Effect

There are 4 steps to analyse the interaction effect of Duration on Genre and Ratings.

Step 1: Plot interaction using interact_plot().

First, we ran an interact plot to see how Genre, as a moderating variable, affect the relationship between Rating and Duration. For this part of analysis, we swapped Genre, instead of Duration, as the moderating variable. This is because Interact Plot can only take a continuous variable as its explanatory variable.

interact_plot(multi_test_2, # plug in your model 
              pred = Duration, # X1 variable: Predictor
              modx = Final_Genre, # X2 variable (Moderator)
              #modx.labels = c("Male", # = 0
              #                "Female"), # Labels for Moderator
              interval = T, # Show Confidence Intervals
              # mean-centering centered = "all", 
              int.width = 0.95, # confidence interval
              #colors = c("tomato3", 
              #           "dodgerblue3"), # Colors for Moderator
              vary.lty = T, # create different shapes for each line (lty = linetype; color; colour; col)
              line.thickness = 1,
              legend.main = "Movie Genre")

From the Interact Plot, it seems that there is a positive relationship between Ratings and Duration. This is seen for all 5 movie genres.

Step 2: Run simple slopes analysis

sim_slopes(multi_test_2, #Plug in your model
           pred = Duration,
           modx = Final_Genre,
           johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of Duration when Final_Genre = Documentary: 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.00     4.50   0.00
## 
## Slope of Duration when Final_Genre = Biography: 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.00     4.47   0.00
## 
## Slope of Duration when Final_Genre = Thriller: 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.03   0.00    20.91   0.00
## 
## Slope of Duration when Final_Genre = Horror: 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.02   0.00    11.89   0.00
## 
## Slope of Duration when Final_Genre = Drama: 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.01   0.00    18.59   0.00

From the results, we found out that across all 5 Genres, the positive correlation between duration and ratings mentioned earlier is very significant.

Step 3: Spotlight analysis

# sim_slopes(multi_test_2, 
#          pred = Final_Genre,
#          modx = Duration,
#          johnson_neyman = T) 

We tried running a spotlight analysis. However, we are unable to do so due to Genre being a categorical variable. As such, we will be using the confidence interval from our visualization later on to make interpretations.

Step 4: Using confidence interval to show difference between genre.

interact_plot(multi_test_2, # plug in your model 
              pred = "Duration", # X1 variable: Predictor
              modx = "Final_Genre", # Numerical Moderator (M - 1SD, M, M + 1SD)
              #modx.labels = c("Low Education (M - 1SD)", # NS
              #                "Average Education", # p = 0.03
              #                "High Education (M + 1SD"), # p = 0.00,
              interval = T, #
              # mean-centering centered = "all", 
              int.width = 0.95, #shrink confidence interval since it is overlapping at 95% confidence interval
              #colors = c("dodgerblue", 
              #           "tomato3",
              #           "darkgreen"),
              vary.lty = T,
              line.thickness = 1,
              legend.main = "Movie Genre") +
  ylim(4,10) +
  xlim(50,250)+
  geom_vline(xintercept = 140, col = "red", linetype = 1, size = 1) +
  geom_vline(xintercept = 200, col = "blue", linetype = 1, size = 1) +
  labs(title = "The Interplay of Movie Genre and Duration on Ratings",
       subtitle = "For shorter movies below 140 minutes, Thriller and Horror Genre score worse in Movie Ratings than other genres.\nHowever, this trend is reversed, which sees Thriller claims top spot as the movie duration increases past 200 minutes.",
       x = "Movie Duration", 
       y = "Movie Ratings",
       caption = "Source: IMDB Dataset") +
  annotate("text", 
           x =120, 
           y = 9,
           label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on confidence intervals") + 
  theme(legend.position = "top",
        text = element_text(family = "Courier"))
## Warning: Removed 270 row(s) containing missing values (geom_path).

The relationships between duration and ratings of movies are similar across all 5 movie genres. Specifically, there appears to be a positive relationship between Ratings and Duration. Thus, movie directors should aim to have longer movies as it seems to generate higher ratings for the 5 genres we have analysed here: Documentary, Drama, Horror, Thriller and Biography. This is especially true for Thriller and horror, as they enjoy steeper increase in ratings when duration increases.

For movies below 140 minutes, it seems to be the case that the top 3 genres are Documentary,Biography and Drama. However, past 200 minutes, the top genre switches to Thriller, owning to its stronger positive correlation between ratings and duration.

4.5 Notched Boxplot of Interaction effect

Thriller and Horror have a steep increase as duration increases. In order to confirm the interaction effect of duration, we decided to do a notched boxplot with mean summary stat and confidence intervals. This allows us to see how the ratings changes for our data across genres as the duration increases.

original_data_3$reordered_Types_of_Duration <- factor(original_data_3$Types_of_Duration, levels = c("Very Short","Short", "Just Nice","Long", "Extremely Long"))

original_data_3 %>% 
  ggplot(aes(x = reorder(Final_Genre,Rating), y = Rating)) +
  geom_boxplot(aes(color = Final_Genre),
               notch = T) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "errorbar",
               color = "purple",
               fun.args = (conf.int = 0.95),
               size = 0.5) + 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "pointrange",
               color = "purple",
               fun.args = (conf.int = 0.95),
               size = 0.1) + 
  geom_jitter(aes(color = Final_Genre),
              alpha = 0.05) + 
  facet_wrap(.~reordered_Types_of_Duration, ncol = 5) + 
  #facet_wrap(vars(Types_of_Duration), ncol = 5L) + 
  labs(title = "Interaction effect confirmation using dataset",
       subtitle = "Thriller and Horror rating increase steeply as duration increases",
       caption = "Source: IMDB",
       x = "Genre Names",
       y = "Rating Scores") +
  theme_linedraw() + 
  theme(axis.text.x = element_text(angle = 20)) + 
  theme(legend.position = "none",
        axis.title = element_text())

Eureka! We found that from our data, Horror Genre increased from 5.4 to 6.3. As for Thriller, the increase was 5.5 to 6.9. Meanwhile, Documentary increased only from 7.3 to 7.7.

6. Interpretation of the Results

From our data science project, we could find the following two findings:

  1. From our ANOVA and Post-Hoc analysis, it is confirmed that all the 19 genres have at least 12 other genres with which they have significantly different means. It shows that there is significant evidence that the type of genre affects the ratings. From the 19 genres, Documentary and Horror are the 2 genres that have ratings that are statistically significant to ALL other genres.

  2. The relationships between duration and ratings of movies are similar across all 5 movie genres. Specifically, there appears to be a positive relationship between Ratings and Duration. Thus, movie directors should aim to have longer movies as it seems to generate higher ratings for the 5 genres we have analysed here: Documentary, Drama, Horror, Thriller and Biography. This is especially true for Thriller and horror, as they enjoy steeper increase in ratings when duration increases.

For movies below 140 minutes, it seems to be the case that the top 3 genres are Documentary,Biographgy and Drama. However, past 200 minutes, the top genre switches to Thriller, owning to its stronger positive correlation between ratings and duration.

Horror and Thriller had an increase of 24% and 25.5% respectively while Documentary only increased 5.5%. We can see that the increase in our data match the predicted data from the visualisation from our multi linear regression above.However, even though Horror and Thriller are predicted to be higher rated as duration increases, it must be noted that most movies do not exceed 200 minutes in length.

7. Further Exploration: Movies Maker Assistant R SHINY APP

To allow potential users (i.e Directors, Production Company Executives) to further explore the characteristics of movies in different genres, our group has created an interactive segment using Shiny package.

According to a study conducted, there is strong link between search volume and box office receipts (Raehsler, 2013). Hence, our group has decided to examine this claim by providing the relative Google search volume (using Google Trends) of the movie title from 1 month before to 1 year after the release date of the movie.

Additionally, we have also used httr package to make HTTP Request to Movie Database API (hosted on RapidAPI) to retrieve movie descriptions based on the title input. Then, we proceed to generate word cloud for the descriptions and perform sentiment analysis to assist potential users in understanding how to create top rating movies in each genre.

7.1 R SHINY APP

  1. Top 10 best and worst rated movies in the selected genre.
  2. Google Trends search volume visualisation for the top and worst movie in the selected genre.
  3. Movie Database API to retrieve description for the top 10 best and worst rated movies.
  4. Wordcloud of the descriptions in step c); one for the group of best movies and one for the worst.
  5. Sentiment Analysis on the description in step c); one for the group of best movies and one for the worst.

You can run this SHINY interactive document hosted at the local server by running the RMarkdown file, OR navigate to our deployed application at SHINY_IMDB for a full online experience.

7.2 Implications of SHINY APP analysis

Based on the above application, it appears that the relative google trend search volume does not differ much between the top rated movies and the lowest rated movies. Furthermore, it should be noted that the Google trend analysis only provides relative search volume over time and cannot convincingly tell if a movie does indeed have a large number of searches.

Lastly, the sentiment analysis and the word cloud serve as additional guidance for movie directors and should not be seen as the exhaustive list of considerations.

8. Limitations and Future Directions

8.1 Limitations

1st limitation: For the purpose of this project, the team has used the official IMDB dataset that has limited variables. While there are other more comprehensive datasets on Kaggle that are scraped from IMDB’s website, it is against IMDB’s terms of use. Hence, we have decided to use the official data provided by IMDB.

Given the limited dataset, our project can only analyse genre, the duration of the movie and the movie release year. However, the success of a movie (in terms of ratings) is dependent on many other factors such as movie budget, directors, cast etc.

2nd limitation: The dataset used is not as robust as compared to proper AB testing. When conducting a survey, there will be steps taken to get a representative sample and the same respondent will answer all the questions. However, on IMDB, different users will provide ratings for different movies. Thus, the movie ratings are from a different population, resulting in heterogeneity of variance, failing the LEVENE’s test.

8.2 Future Direction

As mentioned in limitation 1, using the official IMDB data only provides a small number of variables. Hence, future projects could analyse other important variables such as budget, revenue, cast and directors, age of IMDB voters and others to provide deeper insights on how to create a successful movie that can target different genres, demographics and age groups.

There is huge potential for directors to use data analytics to understand how to create a successful movie. Currently, our R SHINY application allows a user (i.e director) to specifically sieve elements from the top and bottom movies in each genre. This SHINY app can be further developed to include age, country, duration, specific themes of a movie using word clouds and many other visualisations. This program could be improved to allow directors to add different inputs to analyse the best direction to make a successful movie. On top of analysing the factors to improve ratings, the factors to increase the return of investment can also be added.

9. Contribution Statement

Every member contributed equally in their own ways, putting their best foot forward for this project. The contribution statement merely states who took charge of different aspects of this project, but each person was involved in the decision making and ideating process.

Anish: Modelling for ANOVA and POST-HOC. Transforming graphs into interactive visualisations using ggplotly and igraph. Combined and stich together rmarkdown to create flow.

Bang: Creating the random generator for selecting genre during the tidy and transform process. Creating the R SHINY app that combined google trends, wordcloud by web scraping data off Rotten Tomatoes.

Jim Meng: Cleaning, tidying and transforming the data from IMDB. Exploratory data analysis as well as tinkering with the Rmarkdown chunk notations. Scraping Google trends data to be used for R SHINY app.

Joel: Exploratory Data Visualisation as well as communications aspect of project. Supporting R SHINY and data scraping processes.

Yi Rui: Continuously brought together the problem statement, flow and dataset to achieve the aim of the project. Modelling for correlation and the various regressions.

Each member supported the other aspects of this project as and when needed while focusing on their individual strengths to achieve the final aim

10. References

IMDb. (n.d.). IMDb Datasets. Retrieved from https://www.imdb.com/interfaces/

Hall, S., Neale, S., & Project Muse. (2010). Epics, Spectacles, and Blockbusters: A Hollywood History. Detroit, Mich: Wayne State University Press.
Raehsler, L. (2013). How People Search For Movies on Google Predicts Box Office Revenues. Retrieved on November 22, 2020 from https://www.searchenginewatch.com/2013/06/06/how-people-search-for-movies-on-google-predicts-box-office-revenues-study/


Information courtesy of
IMDb
(http://www.imdb.com).
Used with permission.


R SHINY app movie description and details, information courtesy of
Rotten Tomatoes
(https://www.rottentomatoes.com/).
Used by webscraping.


R SHINY app Google Trends, information courtesy of
Google Trends
(https://trends.google.com/).
Using gtrendsR package.